CN113610182A

CN113610182A - User electricity consumption behavior clustering analysis method, system and storage medium

Info

Publication number: CN113610182A
Application number: CN202110952732.0A
Authority: CN
Inventors: 王秀茹; 邱冬; 韩少华; 毛王清; 庞吉年; 葛萱; 刘刚; 王云杰; 贺国梁
Original assignee: State Grid Jiangsu Electric Power Co ltd Suqian Power Supply Branch
Current assignee: State Grid Jiangsu Electric Power Co ltd Suqian Power Supply Branch
Priority date: 2021-08-19
Filing date: 2021-08-19
Publication date: 2021-11-05

Abstract

The invention discloses a user electricity consumption behavior clustering analysis method, a system and a storage medium, wherein a standard mRMR method is improved, a weight factor is introduced into a standard mRMR criterion to refine the measurement of characteristic correlation and redundancy, a k-means clustering algorithm is improved, and a maximum-minimum distance algorithm is adopted to select a clustering center. And then, providing a power utilization feature selection method based on an improvement criterion, selecting independent and effective power utilization features to construct a feature set, and performing user power utilization behavior analysis by adopting an improved k-means clustering algorithm to realize dimension reduction on the user power utilization data. The method has high accuracy and greatly improves the calculation efficiency.

Description

User electricity consumption behavior clustering analysis method, system and storage medium

Technical Field

The invention relates to a power consumer power consumption behavior analysis technology, in particular to a user power consumption behavior cluster analysis method.

Background

The analysis of the power utilization behavior of the users is the basis of a plurality of works such as user load management and scheduling, improvement and transformation of energy efficiency of power users, implementation of a demand response strategy on a power grid side and the like. The big data at the user side is the data embodiment of the power utilization behavior of the user, so that valuable power utilization behavior information of the user can be mined from a large amount of data at the user side by adopting a proper data mining method. The k-means clustering algorithm has the advantages of obvious data similarity measurement and division effect, easiness in implementation and the like, and is widely applied to the aspect of power utilization data mining analysis in the field of intelligent power utilization.

In recent years, many studies have been made to perform cluster analysis of power consumption behavior of a user using a load characteristic index as a power consumption characteristic. Under the current complex power utilization environment, the popularization degree and the utilization degree of various power equipment are increased, the power utilization behaviors of power consumers have diversity and complexity, and the fixed feature set cannot be applied to the analysis target of the power utilization behaviors of all the power consumers and does not have the universality of the power utilization analysis; meanwhile, because the electricity utilization behavior characteristics are closely associated with the electricity utilization habits of the users, when the electricity utilization behaviors are analyzed by utilizing the feature sets, a large number of redundancy and irrelevance characteristics inevitably exist in the feature sets, and the characteristics can undoubtedly increase the complexity and the running time of the algorithm, bring about the problem of dimension disaster and reduce the accuracy of the model. At present, when the load characteristic indexes are used for carrying out cluster analysis instead of original load curve data, only common load characteristic indexes (load rate, daily peak-valley difference rate, peak period load rate, flat period load rate and valley period load rate) are used as power utilization characteristics, and the characteristics are not subjected to data analysis and optimization selection, so that the method has analysis limitation, namely, the method does not have the universality of user power utilization behavior analysis.

Disclosure of Invention

The invention aims to solve the technical problem that the prior art is not enough, and provides a user electricity utilization behavior clustering analysis method, a system and a storage medium, which can improve the correlation of user electricity utilization characteristic sets, reduce redundancy and improve the accuracy of clustering results.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a user electricity consumption behavior cluster analysis method comprises the following steps:

s1, clustering the original load data of the users to obtain original user clustering grouping information; for the feature subset according to the candidate

Clustering the calculated user data to obtain candidate user clustering grouping information;

s2, comparing and analyzing the user clustering grouping information with the original user clustering grouping information to obtain the correct user number of clustering grouping according to a formula

Calculating clustering accuracy

S3, taking the next candidate feature subset

Returning to the step 2) untilObtaining the clustering accuracy of all candidate characteristic subsets;

and S4, recording the maximum value of the clustering accuracy of all candidate feature subsets and the candidate feature subset corresponding to the maximum value, wherein the candidate feature subset is the reduced feature subset.

The method can quickly and accurately obtain the required simplified feature subset from a plurality of candidate feature subsets, so that the clustering result is more accurate. And measuring the quality of the clustering result of each candidate feature subset by introducing a real clustering result index, namely clustering accuracy, wherein the candidate feature subset corresponding to the maximum clustering accuracy is the simplified feature subset. The correct user number of clustering groups used in the calculation of the clustering accuracy is obtained by comparing the user clustering group information with the original user clustering group information, the operation is simple, and the result is clear. There is no redundancy and no independence of features in the reduced feature subset.

In step S1, the specific process of acquiring the original user cluster grouping information includes:

1) giving original load data of a user;

maximum value k of typical electricity consumption behavior category number k of given user_maxMinimum value k_min，

k_minN is the total number of the original load data samples of the user as 2;

2) with k_minFinding a typical electricity consumption behavior clustering center of the user by using a k-means method for the initial value of k, and calculating the similarity W of all users in the typical electricity consumption behavior category of the electricity consumption user; w ═ intra (k) + (1-inter (k)/inter (k));

3) judging whether the k value is larger than k_maxIf less than k_minAdding 1 to the value of (1), and turning to the step 2); otherwise, entering step 4);

4) taking the k value k corresponding to the minimum value of W_best，k_bestThe user typical electricity consumption behavior category optimal clustering number;

6) according to the determined optimal clustering number of the typical power consumption behavior categories of the users, adopting the maximum and minimum distancesDetermining k from an algorithm_bestAnd finally, dividing the original load data of each user into the typical power consumption behavior categories of each user according to the principle of minimum distance to obtain the clustering grouping information of the original users.

The invention solves the problems that the clustering grouping of the original users by the traditional k-means clustering algorithm can lead the clustering center of the typical electricity consumption behavior of the original users to be randomly selected and the category number of the typical electricity consumption behavior of the users to be preset, and avoids the occurrence of the condition of poor accuracy of the clustering grouping result of the original users. The maximum value and the minimum value of the typical power consumption behavior category number k of the user are given, then preliminary calculation is carried out according to a similarity function W among the data clusters, and a k value corresponding to the minimum value W is obtained from the preliminary calculation and serves as an optimal k value, so that the problem that the typical power consumption behavior category number of the user needs to be given in advance is solved. And selecting the typical electricity consumption behavior clustering center of the initial user by adopting a maximum-minimum distance algorithm, and avoiding the random selection of the typical electricity consumption behavior clustering center of the initial user, thereby greatly improving the accuracy of the clustering grouping result of the initial user.

In step S2, the process of determining the number of users with correct cluster grouping includes: for the 1 st group of users in the original user clustering grouping information, acquiring the number of all the users in the 1 st group of users clustered into different groups of candidate user clustering grouping information, if the number clustered into the P-th group of the candidate user clustering grouping information is the maximum, recording the number as P, wherein P is the number of the users in the 1 st group of users with accurate classification; and repeating the steps until the number of accurately classified users in all groups in the original user clustering grouping information is obtained. The method can maximize the number of the users which are clustered accurately, thereby obtaining the maximum clustering accuracy of the current clustering by calculation.

The acquisition process of the candidate feature subset comprises the following steps:

a1, constructing a user electricity utilization feature set;

a2 incremental search algorithm based on maximum correlation and minimum redundancy criterionObtaining a sum weight factor alpha_iA set of nested candidate feature subsets corresponding in value

J is more than or equal to 1 and less than or equal to N, wherein,

all the characteristics are subsets of the user electricity utilization characteristic set; j is a feature number; n is the total number of features; i is more than or equal to 1 and less than or equal to M, and M is the number of weight factors; the weight factor alpha_iThe value interval is [0,1 ]]。

The candidate feature subset acquisition process is simple and quick, the acquired candidate feature subset contains the most valuable information in the user electricity utilization feature set, and meanwhile, the candidate feature subset is simplified. And introducing a maximum correlation minimum redundancy criterion to obtain a candidate feature subset, wherein the criterion can ensure that the correlation between the obtained candidate feature set and the power utilization user category is maximum, and meanwhile, the redundancy relation between each feature in the candidate feature subset is minimum, so that the redundant and irrelevant power utilization features of the user in the subset are eliminated.

The specific implementation process of the step a2 includes:

1) let Q be X, S is the empty set; wherein X is a user electricity utilization characteristic set;

2) let i equal to 1;

3) calculating the electricity utilization characteristic x of the ith user_iMutual information I (x) measured by correlation with target user electricity consumption behavior class c_i(ii) a c) Find a satisfying max [ I (x) ]_i；c)]And the characteristic is expressed as

Order to

Wherein,

4) let x_i∈Q_m-1J-th user electricity usage characteristic x_j∈S_m-1M 2, …, N from Q_m-1In search for

Calculating the feature with the largest value, and expressing the feature as

Order to

Candidate feature subset S to be obtained_m-1、S_NPutting an alternative feature set S;

5) adding 1 to the value of i, returning to the step 2), until the set Q is an empty set, then the obtained alternative feature set S is a candidate feature set, and all candidate feature subsets in the candidate feature set S are according to the reference

The size of the N candidate feature subsets is arranged in a descending order to obtain N candidate feature subsets (the number of the feature subsets is consistent with the total number of the features)

And the relationship between the candidate feature subsets is:

introducing a weight factor alpha_iThe standard maximum correlation minimum redundancy criterion is improved, the correlation and the redundancy weight of the power utilization characteristics of the user can be described in detail, and the incremental search algorithm is used for avoiding exhaustive search of the power utilization characteristic set of the user, so that the optimal candidate characteristic subset is obtained, and the exhaustive search is avoided.

The invention also provides a user power consumption behavior clustering analysis system, which comprises computer equipment; the computer device is configured or programmed for performing the steps of the inventive method.

The present invention also provides a computer readable storage medium comprising a program running on a processor; the program is configured or programmed for carrying out the steps of the inventive method.

Compared with the prior art, the invention has the beneficial effects that:

1. the method combines a feature selection method based on an improved maximum correlation minimum redundancy criterion with an improved k-means clustering method, so that the obtained electricity utilization feature set is a reduced feature set, the obtained electricity utilization feature is used for replacing original load curve data to perform clustering analysis, the dimension reduction of the electricity utilization data is realized on the premise of ensuring the clustering accuracy, and the calculation efficiency is improved;

2. by adopting the improved k-means clustering method, the clustering number can be accurately given, and the clustering center can be accurately selected, so that the clustering result is more accurate;

3. the feature selection method based on the improved maximum correlation minimum redundancy criterion can distinguish the feature correlation and the redundancy weight, and compare and analyze the quality of the feature selection result.

Drawings

FIG. 1 shows a flow chart of the improved k-means algorithm of the present invention;

FIG. 2 is a flow chart of a feature selection method of the present invention based on the improved maximum correlation minimum redundancy criterion;

FIG. 3 is a graph showing clustering accuracy as a function of feature number in accordance with the present invention;

fig. 4, fig. 5, fig. 6, and fig. 7 show the clustering result and the electricity usage characteristic curve of the class 4 user according to the present invention, respectively.

Detailed Description

The embodiment of the invention provides a feature selection method and an improved k-means clustering method based on an improved maximum correlation minimum redundancy criterion aiming at the problems of selection of a power utilization feature set and a k-means clustering method in user power utilization behavior clustering analysis based on feature selection, and combining a maximum correlation minimum redundancy criterion, a similarity function among data cluster classes and a maximum minimum distance algorithm.

The principle of the embodiment of the invention is as follows:

the big data on the user side is data embodiment of the power utilization behaviors of the user, the objects with larger similarity of the characteristic attributes of the power utilization behaviors of the user can be gathered into a plurality of categories by using an improved k-means clustering method, and meanwhile, the larger difference among the categories is ensured, so that the power utilization behaviors of the user are analyzed. In order to perform cluster analysis on the user electricity consumption behavior characteristics, the cluster number of the user electricity consumption behavior characteristics is determined firstly, and therefore an interval of the cluster number of the user electricity consumption behavior characteristics is given firstly

The method comprises the steps of calculating similarity values of all data elements in a data cluster corresponding to each clustering number in a clustering numerical value interval according to a similarity function between data clusters, obtaining a clustering number corresponding to a minimum similarity value as an optimal value, and accordingly reducing the difficulty of determining the clustering number. And then, the obtained reduced feature subset is used for solving the corresponding features of all the electricity users, the features are used for replacing 96-point load data of the users, and the electricity utilization behaviors of the users are analyzed by using an improved k-means algorithm.

The method for cluster analysis of the power consumption behaviors of the users is based on the power consumption behavior characteristics of the users, so the power consumption characteristics of the users need to be selected, the power consumption behavior characteristics of the users are selected to be a proper and effective characteristic subset from an original power consumption characteristic set, the characteristic subset can contain the most valuable information in the original power consumption characteristic set, and meanwhile, the selected characteristic subset is emphasizedThe subset is to be compact; namely, the correlation between the selected feature set and the user behavior category is the largest, the redundancy relation between each electricity utilization feature in the selected subset is the lowest, and the maximum correlation minimum redundancy criterion formula is obtained based on the target. The algorithm firstly calculates mutual information values of each power utilization feature and a target user behavior category in an original feature set, and then takes the power utilization feature corresponding to the maximum mutual information as a first candidate feature subset. The standard maximum correlation minimum redundancy criterion directly differentiates the correlation measurement value of the feature and the target class from the redundancy measurement value between the features, and the method has the defect that the correlation and the redundancy weight of the feature cannot be distinguished. The invention provides a method for improving the maximum correlation minimum redundancy criterion, a variable is considered to be added in the analysis process to refine the measurement of the correlation between the features and the target categories and the redundancy between the features, and the weight occupied by the feature correlation and the redundancy in the maximum correlation minimum redundancy criterion is changed by assigning different values to the variable, so that the corresponding correlation and the redundancy occupied weight conditions can be compared when different feature selection results are obtained. In order to obtain a candidate characteristic subset from the power consumption data of a user, firstly, randomly extracting data from 96-point load data of a power consumer side as sample data for analysis, carrying out cluster training on the sample by using an improved k-means clustering algorithm to obtain a user power consumption behavior class matched with the sample, then, calculating according to power consumption characteristic calculation formulas such as load rate, daily peak-valley difference, peak load rate, average load rate and the like in a table 1 to obtain an original characteristic set, and finally, combining an incremental search algorithm with an improved maximum correlation minimum redundancy criterion to obtain the candidate characteristic subset. The incremental search algorithm firstly calculates mutual information values of each power utilization feature and target user behavior categories in an original feature set, the power utilization feature corresponding to the maximum mutual information value is used as a first candidate feature subset, then the values of mathematical relations defined based on the maximum correlation minimum redundancy criterion between the remaining power utilization features and the candidate feature set generated in the previous step are calculated, the power utilization feature corresponding to the maximum value is selected from the values, the feature is added into the candidate feature set generated in the previous step to obtain a new candidate feature set, and the steps are repeated until the original feature set is subjected to mutual information processing until the power utilization feature and the target user behavior categories are subjected to mutual information processingAnd (4) the syndrome is empty, and finally the features in the candidate feature set are sorted in a descending order according to phi (D, R) to obtain a candidate feature subset. The improved k-means algorithm in the embodiment of the invention is as follows: the similarity of all users in the typical electricity consumption behavior category of the electricity consumption user is expressed by a similarity function W between data clusters, and k is used_max,k_minRepresenting the maximum value and the minimum value of the clustering number k of the typical behaviors of the electricity consumption of the user; expressing the similarity of all users in the typical electricity utilization behavior category of each electricity utilization user by using intra (k); and representing the similarity between the typical electricity consumption behavior categories of the two electricity consumption users by using an Inter (k). Therefore, formula (1) can be obtained by definition (see lie, lie. improved K-means algorithm for the identification study of wind power anomaly data. computer age, 2020, 2: 6-8):

W＝Intra(k)+(1-Inter(k)/Inter(k)) (1)

inter (k) represents the similarity between the typical electricity consumption behavior categories of two electricity consumption users; intra (k) represents the similarity of all users in the typical electricity consumption behavior category of the electricity users; x represents a user load data set with n user load data sets needing to be clustered; vi represents the initial clustering center of typical electricity utilization behaviors of users; δ (vi) represents the similarity with vi as the cluster center point; δ (X) represents the similarity of all user load data in X; and V (i) represents the similarity between the clustering center of the ith user typical electricity utilization behavior category and the clustering centers of the other user typical electricity utilization behavior categories.

Finally, taking the k value of W reaching the minimum value as the optimal clustering number k_bestThen there is k_min≤k_best≤k_max。

Therefore, the improved k-means algorithm flow chart for determining the optimal clustering number k is shown in FIG. 1:

1) determining the maximum and minimum values of kk_max，k_min；

2) From k_minFinding an initial clustering center for the initial value of k by using a k-means algorithm, and calculating the value of W;

3) judging whether the value of k is larger than k_maxE.g. less than k_min＝k_min+1 to 2);

4) taking min (W (k)) to obtain k_best；

5) And finally, dividing the sample data of each power utilization user into typical power utilization behavior categories of each power utilization user according to the principle of minimum distance. The improved maximum correlation minimum redundancy criterion of the embodiment of the invention is as follows:

the maximum correlation and minimum redundancy criterion is based on a mutual information theory, and the correlation relation of the variables is measured by taking a mutual information calculation value between the variables as a standard. It is defined as follows:

given two random variables x and y, the probability density functions of the two variables are p (x) and p (y), the joint probability density is p (x, y), and when the variables x and y are discrete variables, the mutual information between the two variables is defined as

The electricity utilization characteristics and the user category variables are discrete variables, and then

When the logarithm is obtained, the base numbers obtained in different fields are different, and a unified standard does not exist, in the information theory, 2 is often used as the base number, and in the invention, the base number is 2.

The measure of the maximum correlation and the measure of the minimum redundancy among the variables are respectively defined as

In the above formula (7), S is the user power consumption feature set; | S | is the total number of features in the feature set S; x is the number of_iAnd x_jElectricity utilization characteristics for the users in the set; c is a target user category; . I (x)_i(ii) a c) The value of (A) represents the user electricity utilization characteristic x_iAnd the mutual information size between the target user electricity consumption behavior category c is the measure of the correlation between the target user electricity consumption behavior category c and the target user electricity consumption behavior category c; i (x)_i；x_j) The value of (A) represents the user electricity utilization characteristic x_iAnd x_jThe mutual information size between two is a measure of the correlation between the two. The calculated numerical value of D is the measure of the correlation between the user electricity utilization characteristic set and the electricity utilization characteristic category; the calculated numerical value of R is a measure of the redundant information contained in the user power utilization characteristic set.

The maximum correlation minimum redundancy criterion is obtained based on the selected maximum correlation between the user electricity utilization feature set and the user category and the requirement of selecting the target with the lowest redundancy relation among each feature in the subset as follows:

maxΦ(D,R),Φ＝D-R (8)

according to the improvement of the technical scheme on the maximum correlation minimum redundancy criterion, a variable alpha is added to refine the measurement of the correlation between the features and the target class and the redundancy between the features, different values are assigned to alpha to change the weight occupied by the feature correlation and the redundancy in the maximum correlation minimum redundancy criterion, and the advantages and the disadvantages of feature selection results are compared and analyzed. Accordingly, the maximum correlation minimum redundancy criterion formula (9) is modified to

maxΦ(D,R),Φ＝αD-(1-α)R (9)

Equation (8) is a special form of equation (9), where α is 0.5.

In the application of the method of the invention, an incremental search algorithm is used to select the effective features that satisfy the mathematical relationship defined by the variable Φ. Using variable X to represent original userSet of electrical characteristics, variable S_m-1Representing the selected user electrical feature set, wherein the variable comprises m-1 selected features, the m-th feature is selected based on the maximum correlation minimum redundancy criterion, namely the m-th feature is selected from the rest feature set { X-S_m-1Selecting a feature, x, that maximizes the variable phi_j＝x_mAnd x_mSatisfies the following conditions:

the improved k-means clustering algorithm of the invention verifies the characteristic clustering effect as follows.

The method and the device perform clustering by respectively taking the characteristics in the N nested candidate characteristic subsets as clustering dimensions through improving a k-means clustering algorithm, and determine the simplified characteristic subsets by utilizing the measurement of clustering accuracy.

Forming a candidate feature set:

the method comprises the steps that a variable X is an original user electricity feature set and comprises N features, a variable S represents a selected user electricity feature set, and a variable Q represents a to-be-selected user electricity feature set; the detailed steps to form the candidate feature set are as follows:

1) and initializing the characteristics. Let Q be X, S be the empty set.

2) Let i equal to 1;

3) according to the formulas (4) and (5)

Mutual information I (x) of relevance measure between target user behavior category c_i(ii) a c) Find a satisfying max [ I (x) ]_i；c)]And representing the feature as

Order to

4) Let x_i∈Q_m-1，x_j∈S_m-1(m 2, …, N), from Q_m-1To find the feature that maximizes the calculated value of equation (10), which is expressed as

Order to

5) i is added with 1, the step 3) is returned until the set Q becomes an empty set and is cut off, the obtained candidate feature set S is a candidate feature set, all candidate feature subsets in the candidate feature set S are arranged in a descending order according to the calculation size of the formula (10), and N candidate feature subsets (the number of the feature subsets is consistent with the total number of the features) can be obtained in this way

And the relationship between the candidate subsets is:

② forming reduced feature subsets

The clustering accuracy A (S) is an index for evaluating the quality of the feature subset, and the calculation formula is as follows:

the specific steps of utilizing the improved k-means clustering algorithm to verify the feature subset and obtaining the optimal feature subset with the maximum clustering accuracy are as follows:

let F denote the feature set with the greatest clustering accuracy.

1) For the weighting factor alpha, the value interval is [0,1 ]](ii) a The initial value is 0 and is assigned with 0.25 as step length, namely alpha_iThe variation of the weighting factors represents an improvement to the criterion of maximum correlation minimum redundancy, representing different emphasis on correlation and redundancy, 0, 0.25, …,1, 1 ≦ i ≦ 5.

2) Respectively obtaining and weighting factors alpha by adopting incremental search algorithm based on improved maximum correlation minimum redundancy criterion_iA set of nested candidate feature subsets corresponding in value

J is more than or equal to 1 and less than or equal to N. Wherein i corresponds to a weight factor α_i(α _i0, 0.25, …,1, 1 ≦ i ≦ 5); j is a feature number; and N is the total number of the features.

3) For each weight factor alpha_iCalculating the clustering accuracy of each feature subset by adopting an improved k-means clustering algorithm to obtain the corresponding nested candidate feature set in a mode of increasing 1 feature each time

And recording the maximum clustering accuracy and the corresponding characteristic subset.

4) And comparing the clustering accuracy test results of the feature subsets under all the weight factors to obtain the feature subset with the maximum clustering accuracy, namely the obtained simplified feature subset F.

The feature selection method based on the improved maximum correlation minimum redundancy criterion (mRMR) of the embodiment of the present invention is as follows.

Because the main idea of the invention is to select a reduced feature set which can be applied to the user electricity category cluster analysis from the user electricity utilization feature set, the variable x appearing hereinafter is referred to as an electricity utilization feature, and the variable c is referred to as a user category. The general flow of power usage feature selection is shown in fig. 2.

As can be seen from fig. 2, the method is mainly divided into 3 steps:

1) an original feature set is constructed.

2) The candidate feature subset is derived using an improved maximum correlation minimum redundancy criterion.

3) And clustering the obtained N nested candidate feature subsets by using an improved k-means algorithm, and determining the optimal feature subset by means of the measurement comparison of the clustering accuracy (namely the number of accurately classified users/the total number of users). The variable N in fig. 2 indicates the total number of features that the original feature set has, and N is 11.

The experiments of the above examples were verified as follows.

The daily electricity consumption data of 500 power consumers in a certain area is sampled every 15 minutes and is 96-point load data. After the daily electricity consumption data are normalized, an original feature set is constructed according to the electricity consumption features given in the table 1. Then extracting power utilization characteristics to obtain an optimal characteristic set F, performing improved k-means clustering, and calculating clustering accuracy according to an original load curve clustering and referring to a formula (10); the original load curve clustering result is that 96-point load sampling data of 500 users form a 500 x 96-dimensional load data matrix, and then an improved k-means clustering algorithm is adopted for clustering to obtain clustering groups and the number of users in each group.

The invention proposes a preliminary set of user power characteristics, as shown in table 1.

TABLE 1 subscriber power feature set

In table 1, P represents a load, and Q represents a used amount of electricity; subscripts sum, av, max, min represent total, mean, maximum, minimum, respectively; peak, val and sh respectively represent peak period, valley period and plateau period; peak, av.val, av.sh represent peak mean, valley mean, average mean, respectively; the electricity load peak-valley time period division of the Shanghai region referenced by the peak-valley time period division in the table.

The simulation is carried out by running a computer with 4GB internal memory at 2.0GHz CPU main frequency (Central Processing Unit) by means of a mathematic tool MATLAB, and all data are subjected to normalized Processing before the experiment.

1) Influence of feature number on clustering result

As can be seen from fig. 3, the clustering accuracy also shows an increasing trend as the feature number is increased, but when the clustering accuracy increases to a maximum value, and the feature number is increased, the clustering accuracy will be substantially unchanged or even decrease. This indicates that one added feature cannot increase the accuracy of clustering, and even adversely affects the clustering result. This result also confirms the necessity of performing feature selection; meanwhile, the phenomenon of characteristic selection overfitting also exists in the electricity utilization characteristic selection problem.

2) Influence of weight factors on clustering accuracy

And comparing the clustering accuracy of the candidate feature subset groups under each weight factor, and selecting the feature subset with the highest clustering accuracy as the optimal feature subset, as shown in table 2.

TABLE 2 maximum clustering accuracy feature subsets corresponding to different weighting factors

As can be seen from table 2, the clustering accuracy is the greatest when the weight factor α is 0.75, so that the optimal feature subset a is { a11, a4, a7 }. The clustering accuracy of comparing the values of alpha to be 0.5 and 0.75 can be known as follows: after the improvement operation is carried out by introducing the weight factors, not only the clustering accuracy is improved to 0.943 from the original 0.931, but also the feature dimension of the obtained optimal feature set is reduced to 3 dimensions from the original 5 dimensions. The improved method provided by the invention can describe the feature correlation and redundancy in a more detailed manner, can effectively reduce the redundant features among the features, and increases the clustering accuracy while realizing dimension reduction.

Then, the corresponding characteristics of all the users are obtained by using the selected optimal characteristic subset, 96-point load data of the users are replaced by the optimal characteristic subset, the electricity utilization behaviors of 500 users are analyzed by using an improved k-means algorithm, the optimal clustering number is determined to be 4, and the result is shown in a table 3; that is, the users are classified into 4 types, and as shown in fig. 4, 5, 6, and 7, the power consumption data of the power consumers in fig. 4, 5, 6, and 7 is normalized, and the white line in the figure represents typical power consumption behaviors of the various power consumers.

TABLE 3 determination of optimal cluster number for improved k-means clustering algorithm

Number of clusters k	All data element similarities W (k) within a data cluster
		2	0.09836
3	0.12785
		4	0.07453
5	0.24787
		6	0.44026
7	0.56782
		8	0.49867

3) Influence of feature extraction on classification of user electricity consumption behaviors

The optimal characteristic subset obtained by selection is used for replacing 96-point load data of users, and power utilization behavior analysis is carried out on 500 power users; the method for analyzing the power consumption of the user based on the cloud computing is used as a comparison method 1, and is different from the technical scheme of the invention in that the method directly refers to the characteristics of daily load rate, valley power coefficient, flat section power consumption percentage and peak time power consumption rate to analyze the power consumption of the user; directly clustering 11 features of the original feature set without feature extraction as clustering analysis dimensions to serve as a comparative analysis method 2; the analysis results obtained are shown in table 4.

TABLE 4 comparison of the Properties of the different experimental methods

Method	Clustering accuracy/%	Number of clustering iterations	Clustering time/s
				Comparative method 1	82.38	18	0.909
Comparative method 2	88.10	10	0.405
				The method of the invention	95.23	8	0.315

As can be seen from table 4, the difference between the clustering accuracy of the comparison method 1 and the method of the present invention is large, and the load curves have diversity due to the diversity of the power users, so it is very necessary to select features to obtain a suitable analysis feature set for different users in terms of analysis accuracy; the feature selection method provided by the invention can be used for selecting features according to the characteristics of different data sets to obtain a proper analysis feature set, so that the method can be well suitable for the analysis of user data sets.

4) Influence of whether feature value is extracted or not on calculation performance

As can be seen from table 4, the clustering iteration times and the clustering time of the method of the present invention are both smaller than those of the two comparison methods, and although the difference of the operation time is not significant due to the small amount of user data, the method of the present invention can be obtained to effectively select the precise and simple characteristics suitable for the power consumption data analysis, so as to reduce the clustering iteration times and the clustering time of the clustering analysis.

Meanwhile, because the clustering algorithm is based on k-means, the time complexity of the clustering method can be represented as O (mktt) by taking the k-means clustering method as an example, wherein m represents the number of data objects for carrying out clustering analysis operation, n represents the dimension of the data objects, k represents the number of clusters, and t represents the iteration times of the operation process of the clustering algorithm; therefore, when the number of the power consumers subjected to clustering analysis is the same as the number of the clustering centers, the method replaces 96-point load data by the optimal characteristics, so that the corresponding t x n value is reduced when the method is adopted, and the time complexity of the method is lower than that of a method for analyzing the power consumption data only by using a k-means method; meanwhile, the clustering accuracy of the method also keeps 95.23% of the accuracy of the direct clustering result of the original load curve, so that the method can effectively select simplified characteristics, realize the dimension reduction of the load curve while ensuring the clustering accuracy, and reduce the time complexity of operational analysis.

Claims

1. A user electricity consumption behavior cluster analysis method is characterized by comprising the following steps:

Calculating clustering accuracy

S3, taking the next candidate feature subset

Returning to the step S2 until the clustering accuracy of all candidate feature subsets is obtained;

2. The user power consumption behavior cluster analysis method according to claim 1, wherein in step S1, the specific obtaining process of the original user cluster grouping information includes:

1) giving original load data of a user; maximum value k of typical electricity consumption behavior category number k of given user_maxMinimum value k_min，

N is the total number of the original load data samples of the user;

2) with k_minFinding a typical electricity consumption behavior clustering center of the user by using a k-means method for the initial value of k, and calculating the similarity W of all users in the typical electricity consumption behavior category of the electricity consumption user;

5) determining k by adopting a maximum-minimum distance algorithm according to the determined optimal clustering number of typical power consumption behavior categories of the user_bestAnd finally, dividing the original load data of each user into the typical power consumption behavior categories of each user according to the principle of minimum distance to obtain the clustering grouping information of the original users.

3. The user electricity consumption behavior cluster analysis method according to claim 1, wherein the determining process of the number of users with correct cluster grouping in step S2 includes:

for the 1 st group of users in the original user clustering grouping information, acquiring the number of all the users in the 1 st group of users clustered into different groups of candidate user clustering grouping information, if the number clustered into the P-th group of the candidate user clustering grouping information is the maximum, recording the number as P, wherein P is the number of the users in the 1 st group of users with accurate classification; and repeating the steps until the number of accurately classified users in all groups in the original user clustering grouping information is obtained.

4. The user electricity consumption behavior cluster analysis method according to claim 1, wherein the obtaining process of the candidate feature subset comprises:

a1, constructing a user electricity utilization feature set;

a2, adopting the most basicIncremental search algorithm acquisition and weighting factor alpha for large correlation minimum redundancy criterion_iA set of nested candidate feature subsets corresponding in value

Wherein,

5. The user electricity consumption behavior cluster analysis method according to claim 4, wherein the specific implementation process of the step A2 comprises:

2) let i equal to 1;

Order to

Wherein,

Calculating the feature with the largest value, and expressing the feature as

Order to

5) adding 1 to the value of i, returning to the step 3), until the set Q is an empty set, then the obtained alternative feature set S is a candidate feature set, and all candidate feature subsets in the candidate feature set S are classified according to the formula

The sizes of the N candidate feature subsets are arranged in a descending order to obtain N candidate feature subsets

And the relationship between the candidate feature subsets is:

6. a user power consumption behavior clustering analysis system is characterized by comprising computer equipment; the computer device is configured or programmed for carrying out the steps of the method according to one of claims 1 to 5.

7. A computer-readable storage medium comprising a program running on a processor; the program is configured or programmed for carrying out the steps of the method according to one of claims 1 to 5.