CN113610182A - User electricity consumption behavior clustering analysis method, system and storage medium - Google Patents

User electricity consumption behavior clustering analysis method, system and storage medium Download PDF

Info

Publication number
CN113610182A
CN113610182A CN202110952732.0A CN202110952732A CN113610182A CN 113610182 A CN113610182 A CN 113610182A CN 202110952732 A CN202110952732 A CN 202110952732A CN 113610182 A CN113610182 A CN 113610182A
Authority
CN
China
Prior art keywords
user
clustering
feature
users
consumption behavior
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110952732.0A
Other languages
Chinese (zh)
Inventor
王秀茹
邱冬
韩少华
毛王清
庞吉年
葛萱
刘刚
王云杰
贺国梁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Jiangsu Electric Power Co ltd Suqian Power Supply Branch
Original Assignee
State Grid Jiangsu Electric Power Co ltd Suqian Power Supply Branch
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Jiangsu Electric Power Co ltd Suqian Power Supply Branch filed Critical State Grid Jiangsu Electric Power Co ltd Suqian Power Supply Branch
Priority to CN202110952732.0A priority Critical patent/CN113610182A/en
Publication of CN113610182A publication Critical patent/CN113610182A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Economics (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Public Health (AREA)
  • Tourism & Hospitality (AREA)
  • Primary Health Care (AREA)
  • Human Resources & Organizations (AREA)
  • General Health & Medical Sciences (AREA)
  • Water Supply & Treatment (AREA)
  • Probability & Statistics with Applications (AREA)
  • Game Theory and Decision Science (AREA)
  • Supply And Distribution Of Alternating Current (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a user electricity consumption behavior clustering analysis method, a system and a storage medium, wherein a standard mRMR method is improved, a weight factor is introduced into a standard mRMR criterion to refine the measurement of characteristic correlation and redundancy, a k-means clustering algorithm is improved, and a maximum-minimum distance algorithm is adopted to select a clustering center. And then, providing a power utilization feature selection method based on an improvement criterion, selecting independent and effective power utilization features to construct a feature set, and performing user power utilization behavior analysis by adopting an improved k-means clustering algorithm to realize dimension reduction on the user power utilization data. The method has high accuracy and greatly improves the calculation efficiency.

Description

User electricity consumption behavior clustering analysis method, system and storage medium
Technical Field
The invention relates to a power consumer power consumption behavior analysis technology, in particular to a user power consumption behavior cluster analysis method.
Background
The analysis of the power utilization behavior of the users is the basis of a plurality of works such as user load management and scheduling, improvement and transformation of energy efficiency of power users, implementation of a demand response strategy on a power grid side and the like. The big data at the user side is the data embodiment of the power utilization behavior of the user, so that valuable power utilization behavior information of the user can be mined from a large amount of data at the user side by adopting a proper data mining method. The k-means clustering algorithm has the advantages of obvious data similarity measurement and division effect, easiness in implementation and the like, and is widely applied to the aspect of power utilization data mining analysis in the field of intelligent power utilization.
In recent years, many studies have been made to perform cluster analysis of power consumption behavior of a user using a load characteristic index as a power consumption characteristic. Under the current complex power utilization environment, the popularization degree and the utilization degree of various power equipment are increased, the power utilization behaviors of power consumers have diversity and complexity, and the fixed feature set cannot be applied to the analysis target of the power utilization behaviors of all the power consumers and does not have the universality of the power utilization analysis; meanwhile, because the electricity utilization behavior characteristics are closely associated with the electricity utilization habits of the users, when the electricity utilization behaviors are analyzed by utilizing the feature sets, a large number of redundancy and irrelevance characteristics inevitably exist in the feature sets, and the characteristics can undoubtedly increase the complexity and the running time of the algorithm, bring about the problem of dimension disaster and reduce the accuracy of the model. At present, when the load characteristic indexes are used for carrying out cluster analysis instead of original load curve data, only common load characteristic indexes (load rate, daily peak-valley difference rate, peak period load rate, flat period load rate and valley period load rate) are used as power utilization characteristics, and the characteristics are not subjected to data analysis and optimization selection, so that the method has analysis limitation, namely, the method does not have the universality of user power utilization behavior analysis.
Disclosure of Invention
The invention aims to solve the technical problem that the prior art is not enough, and provides a user electricity utilization behavior clustering analysis method, a system and a storage medium, which can improve the correlation of user electricity utilization characteristic sets, reduce redundancy and improve the accuracy of clustering results.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a user electricity consumption behavior cluster analysis method comprises the following steps:
s1, clustering the original load data of the users to obtain original user clustering grouping information; for the feature subset according to the candidate
Figure BDA0003219152910000011
Clustering the calculated user data to obtain candidate user clustering grouping information;
s2, comparing and analyzing the user clustering grouping information with the original user clustering grouping information to obtain the correct user number of clustering grouping according to a formula
Figure BDA0003219152910000021
Calculating clustering accuracy
Figure BDA0003219152910000022
S3, taking the next candidate feature subset
Figure BDA0003219152910000023
Returning to the step 2) untilObtaining the clustering accuracy of all candidate characteristic subsets;
and S4, recording the maximum value of the clustering accuracy of all candidate feature subsets and the candidate feature subset corresponding to the maximum value, wherein the candidate feature subset is the reduced feature subset.
The method can quickly and accurately obtain the required simplified feature subset from a plurality of candidate feature subsets, so that the clustering result is more accurate. And measuring the quality of the clustering result of each candidate feature subset by introducing a real clustering result index, namely clustering accuracy, wherein the candidate feature subset corresponding to the maximum clustering accuracy is the simplified feature subset. The correct user number of clustering groups used in the calculation of the clustering accuracy is obtained by comparing the user clustering group information with the original user clustering group information, the operation is simple, and the result is clear. There is no redundancy and no independence of features in the reduced feature subset.
In step S1, the specific process of acquiring the original user cluster grouping information includes:
1) giving original load data of a user;
maximum value k of typical electricity consumption behavior category number k of given usermaxMinimum value kmin
Figure BDA0003219152910000024
kminN is the total number of the original load data samples of the user as 2;
2) with kminFinding a typical electricity consumption behavior clustering center of the user by using a k-means method for the initial value of k, and calculating the similarity W of all users in the typical electricity consumption behavior category of the electricity consumption user; w ═ intra (k) + (1-inter (k)/inter (k));
3) judging whether the k value is larger than kmaxIf less than kminAdding 1 to the value of (1), and turning to the step 2); otherwise, entering step 4);
4) taking the k value k corresponding to the minimum value of Wbest,kbestThe user typical electricity consumption behavior category optimal clustering number;
6) according to the determined optimal clustering number of the typical power consumption behavior categories of the users, adopting the maximum and minimum distancesDetermining k from an algorithmbestAnd finally, dividing the original load data of each user into the typical power consumption behavior categories of each user according to the principle of minimum distance to obtain the clustering grouping information of the original users.
The invention solves the problems that the clustering grouping of the original users by the traditional k-means clustering algorithm can lead the clustering center of the typical electricity consumption behavior of the original users to be randomly selected and the category number of the typical electricity consumption behavior of the users to be preset, and avoids the occurrence of the condition of poor accuracy of the clustering grouping result of the original users. The maximum value and the minimum value of the typical power consumption behavior category number k of the user are given, then preliminary calculation is carried out according to a similarity function W among the data clusters, and a k value corresponding to the minimum value W is obtained from the preliminary calculation and serves as an optimal k value, so that the problem that the typical power consumption behavior category number of the user needs to be given in advance is solved. And selecting the typical electricity consumption behavior clustering center of the initial user by adopting a maximum-minimum distance algorithm, and avoiding the random selection of the typical electricity consumption behavior clustering center of the initial user, thereby greatly improving the accuracy of the clustering grouping result of the initial user.
In step S2, the process of determining the number of users with correct cluster grouping includes: for the 1 st group of users in the original user clustering grouping information, acquiring the number of all the users in the 1 st group of users clustered into different groups of candidate user clustering grouping information, if the number clustered into the P-th group of the candidate user clustering grouping information is the maximum, recording the number as P, wherein P is the number of the users in the 1 st group of users with accurate classification; and repeating the steps until the number of accurately classified users in all groups in the original user clustering grouping information is obtained. The method can maximize the number of the users which are clustered accurately, thereby obtaining the maximum clustering accuracy of the current clustering by calculation.
The acquisition process of the candidate feature subset comprises the following steps:
a1, constructing a user electricity utilization feature set;
a2 incremental search algorithm based on maximum correlation and minimum redundancy criterionObtaining a sum weight factor alphaiA set of nested candidate feature subsets corresponding in value
Figure BDA0003219152910000031
J is more than or equal to 1 and less than or equal to N, wherein,
Figure BDA0003219152910000032
all the characteristics are subsets of the user electricity utilization characteristic set; j is a feature number; n is the total number of features; i is more than or equal to 1 and less than or equal to M, and M is the number of weight factors; the weight factor alphaiThe value interval is [0,1 ]]。
The candidate feature subset acquisition process is simple and quick, the acquired candidate feature subset contains the most valuable information in the user electricity utilization feature set, and meanwhile, the candidate feature subset is simplified. And introducing a maximum correlation minimum redundancy criterion to obtain a candidate feature subset, wherein the criterion can ensure that the correlation between the obtained candidate feature set and the power utilization user category is maximum, and meanwhile, the redundancy relation between each feature in the candidate feature subset is minimum, so that the redundant and irrelevant power utilization features of the user in the subset are eliminated.
The specific implementation process of the step a2 includes:
1) let Q be X, S is the empty set; wherein X is a user electricity utilization characteristic set;
2) let i equal to 1;
3) calculating the electricity utilization characteristic x of the ith useriMutual information I (x) measured by correlation with target user electricity consumption behavior class ci(ii) a c) Find a satisfying max [ I (x) ]i;c)]And the characteristic is expressed as
Figure BDA0003219152910000033
Order to
Figure BDA0003219152910000034
Figure BDA0003219152910000035
Figure BDA0003219152910000036
Wherein,
Figure BDA0003219152910000037
4) let xi∈Qm-1J-th user electricity usage characteristic xj∈Sm-1M 2, …, N from Qm-1In search for
Figure BDA0003219152910000041
Calculating the feature with the largest value, and expressing the feature as
Figure BDA0003219152910000042
Order to
Figure BDA0003219152910000043
Candidate feature subset S to be obtainedm-1、SNPutting an alternative feature set S;
5) adding 1 to the value of i, returning to the step 2), until the set Q is an empty set, then the obtained alternative feature set S is a candidate feature set, and all candidate feature subsets in the candidate feature set S are according to the reference
Figure BDA0003219152910000044
The size of the N candidate feature subsets is arranged in a descending order to obtain N candidate feature subsets (the number of the feature subsets is consistent with the total number of the features)
Figure BDA0003219152910000045
And the relationship between the candidate feature subsets is:
Figure BDA0003219152910000046
introducing a weight factor alphaiThe standard maximum correlation minimum redundancy criterion is improved, the correlation and the redundancy weight of the power utilization characteristics of the user can be described in detail, and the incremental search algorithm is used for avoiding exhaustive search of the power utilization characteristic set of the user, so that the optimal candidate characteristic subset is obtained, and the exhaustive search is avoided.
The invention also provides a user power consumption behavior clustering analysis system, which comprises computer equipment; the computer device is configured or programmed for performing the steps of the inventive method.
The present invention also provides a computer readable storage medium comprising a program running on a processor; the program is configured or programmed for carrying out the steps of the inventive method.
Compared with the prior art, the invention has the beneficial effects that:
1. the method combines a feature selection method based on an improved maximum correlation minimum redundancy criterion with an improved k-means clustering method, so that the obtained electricity utilization feature set is a reduced feature set, the obtained electricity utilization feature is used for replacing original load curve data to perform clustering analysis, the dimension reduction of the electricity utilization data is realized on the premise of ensuring the clustering accuracy, and the calculation efficiency is improved;
2. by adopting the improved k-means clustering method, the clustering number can be accurately given, and the clustering center can be accurately selected, so that the clustering result is more accurate;
3. the feature selection method based on the improved maximum correlation minimum redundancy criterion can distinguish the feature correlation and the redundancy weight, and compare and analyze the quality of the feature selection result.
Drawings
FIG. 1 shows a flow chart of the improved k-means algorithm of the present invention;
FIG. 2 is a flow chart of a feature selection method of the present invention based on the improved maximum correlation minimum redundancy criterion;
FIG. 3 is a graph showing clustering accuracy as a function of feature number in accordance with the present invention;
fig. 4, fig. 5, fig. 6, and fig. 7 show the clustering result and the electricity usage characteristic curve of the class 4 user according to the present invention, respectively.
Detailed Description
The embodiment of the invention provides a feature selection method and an improved k-means clustering method based on an improved maximum correlation minimum redundancy criterion aiming at the problems of selection of a power utilization feature set and a k-means clustering method in user power utilization behavior clustering analysis based on feature selection, and combining a maximum correlation minimum redundancy criterion, a similarity function among data cluster classes and a maximum minimum distance algorithm.
The principle of the embodiment of the invention is as follows:
the big data on the user side is data embodiment of the power utilization behaviors of the user, the objects with larger similarity of the characteristic attributes of the power utilization behaviors of the user can be gathered into a plurality of categories by using an improved k-means clustering method, and meanwhile, the larger difference among the categories is ensured, so that the power utilization behaviors of the user are analyzed. In order to perform cluster analysis on the user electricity consumption behavior characteristics, the cluster number of the user electricity consumption behavior characteristics is determined firstly, and therefore an interval of the cluster number of the user electricity consumption behavior characteristics is given firstly
Figure BDA0003219152910000051
The method comprises the steps of calculating similarity values of all data elements in a data cluster corresponding to each clustering number in a clustering numerical value interval according to a similarity function between data clusters, obtaining a clustering number corresponding to a minimum similarity value as an optimal value, and accordingly reducing the difficulty of determining the clustering number. And then, the obtained reduced feature subset is used for solving the corresponding features of all the electricity users, the features are used for replacing 96-point load data of the users, and the electricity utilization behaviors of the users are analyzed by using an improved k-means algorithm.
The method for cluster analysis of the power consumption behaviors of the users is based on the power consumption behavior characteristics of the users, so the power consumption characteristics of the users need to be selected, the power consumption behavior characteristics of the users are selected to be a proper and effective characteristic subset from an original power consumption characteristic set, the characteristic subset can contain the most valuable information in the original power consumption characteristic set, and meanwhile, the selected characteristic subset is emphasizedThe subset is to be compact; namely, the correlation between the selected feature set and the user behavior category is the largest, the redundancy relation between each electricity utilization feature in the selected subset is the lowest, and the maximum correlation minimum redundancy criterion formula is obtained based on the target. The algorithm firstly calculates mutual information values of each power utilization feature and a target user behavior category in an original feature set, and then takes the power utilization feature corresponding to the maximum mutual information as a first candidate feature subset. The standard maximum correlation minimum redundancy criterion directly differentiates the correlation measurement value of the feature and the target class from the redundancy measurement value between the features, and the method has the defect that the correlation and the redundancy weight of the feature cannot be distinguished. The invention provides a method for improving the maximum correlation minimum redundancy criterion, a variable is considered to be added in the analysis process to refine the measurement of the correlation between the features and the target categories and the redundancy between the features, and the weight occupied by the feature correlation and the redundancy in the maximum correlation minimum redundancy criterion is changed by assigning different values to the variable, so that the corresponding correlation and the redundancy occupied weight conditions can be compared when different feature selection results are obtained. In order to obtain a candidate characteristic subset from the power consumption data of a user, firstly, randomly extracting data from 96-point load data of a power consumer side as sample data for analysis, carrying out cluster training on the sample by using an improved k-means clustering algorithm to obtain a user power consumption behavior class matched with the sample, then, calculating according to power consumption characteristic calculation formulas such as load rate, daily peak-valley difference, peak load rate, average load rate and the like in a table 1 to obtain an original characteristic set, and finally, combining an incremental search algorithm with an improved maximum correlation minimum redundancy criterion to obtain the candidate characteristic subset. The incremental search algorithm firstly calculates mutual information values of each power utilization feature and target user behavior categories in an original feature set, the power utilization feature corresponding to the maximum mutual information value is used as a first candidate feature subset, then the values of mathematical relations defined based on the maximum correlation minimum redundancy criterion between the remaining power utilization features and the candidate feature set generated in the previous step are calculated, the power utilization feature corresponding to the maximum value is selected from the values, the feature is added into the candidate feature set generated in the previous step to obtain a new candidate feature set, and the steps are repeated until the original feature set is subjected to mutual information processing until the power utilization feature and the target user behavior categories are subjected to mutual information processingAnd (4) the syndrome is empty, and finally the features in the candidate feature set are sorted in a descending order according to phi (D, R) to obtain a candidate feature subset. The improved k-means algorithm in the embodiment of the invention is as follows: the similarity of all users in the typical electricity consumption behavior category of the electricity consumption user is expressed by a similarity function W between data clusters, and k is usedmax,kminRepresenting the maximum value and the minimum value of the clustering number k of the typical behaviors of the electricity consumption of the user; expressing the similarity of all users in the typical electricity utilization behavior category of each electricity utilization user by using intra (k); and representing the similarity between the typical electricity consumption behavior categories of the two electricity consumption users by using an Inter (k). Therefore, formula (1) can be obtained by definition (see lie, lie. improved K-means algorithm for the identification study of wind power anomaly data. computer age, 2020, 2: 6-8):
W=Intra(k)+(1-Inter(k)/Inter(k)) (1)
Figure BDA0003219152910000061
Figure BDA0003219152910000062
inter (k) represents the similarity between the typical electricity consumption behavior categories of two electricity consumption users; intra (k) represents the similarity of all users in the typical electricity consumption behavior category of the electricity users; x represents a user load data set with n user load data sets needing to be clustered; vi represents the initial clustering center of typical electricity utilization behaviors of users; δ (vi) represents the similarity with vi as the cluster center point; δ (X) represents the similarity of all user load data in X; and V (i) represents the similarity between the clustering center of the ith user typical electricity utilization behavior category and the clustering centers of the other user typical electricity utilization behavior categories.
Finally, taking the k value of W reaching the minimum value as the optimal clustering number kbestThen there is kmin≤kbest≤kmax
Therefore, the improved k-means algorithm flow chart for determining the optimal clustering number k is shown in FIG. 1:
1) determining the maximum and minimum values of kkmax,kmin
2) From kminFinding an initial clustering center for the initial value of k by using a k-means algorithm, and calculating the value of W;
3) judging whether the value of k is larger than kmaxE.g. less than kmin=kmin+1 to 2);
4) taking min (W (k)) to obtain kbest
5) And finally, dividing the sample data of each power utilization user into typical power utilization behavior categories of each power utilization user according to the principle of minimum distance. The improved maximum correlation minimum redundancy criterion of the embodiment of the invention is as follows:
the maximum correlation and minimum redundancy criterion is based on a mutual information theory, and the correlation relation of the variables is measured by taking a mutual information calculation value between the variables as a standard. It is defined as follows:
given two random variables x and y, the probability density functions of the two variables are p (x) and p (y), the joint probability density is p (x, y), and when the variables x and y are discrete variables, the mutual information between the two variables is defined as
Figure BDA0003219152910000071
The electricity utilization characteristics and the user category variables are discrete variables, and then
Figure BDA0003219152910000072
When the logarithm is obtained, the base numbers obtained in different fields are different, and a unified standard does not exist, in the information theory, 2 is often used as the base number, and in the invention, the base number is 2.
The measure of the maximum correlation and the measure of the minimum redundancy among the variables are respectively defined as
Figure BDA0003219152910000073
Figure BDA0003219152910000074
In the above formula (7), S is the user power consumption feature set; | S | is the total number of features in the feature set S; x is the number ofiAnd xjElectricity utilization characteristics for the users in the set; c is a target user category; . I (x)i(ii) a c) The value of (A) represents the user electricity utilization characteristic xiAnd the mutual information size between the target user electricity consumption behavior category c is the measure of the correlation between the target user electricity consumption behavior category c and the target user electricity consumption behavior category c; i (x)i;xj) The value of (A) represents the user electricity utilization characteristic xiAnd xjThe mutual information size between two is a measure of the correlation between the two. The calculated numerical value of D is the measure of the correlation between the user electricity utilization characteristic set and the electricity utilization characteristic category; the calculated numerical value of R is a measure of the redundant information contained in the user power utilization characteristic set.
The maximum correlation minimum redundancy criterion is obtained based on the selected maximum correlation between the user electricity utilization feature set and the user category and the requirement of selecting the target with the lowest redundancy relation among each feature in the subset as follows:
maxΦ(D,R),Φ=D-R (8)
according to the improvement of the technical scheme on the maximum correlation minimum redundancy criterion, a variable alpha is added to refine the measurement of the correlation between the features and the target class and the redundancy between the features, different values are assigned to alpha to change the weight occupied by the feature correlation and the redundancy in the maximum correlation minimum redundancy criterion, and the advantages and the disadvantages of feature selection results are compared and analyzed. Accordingly, the maximum correlation minimum redundancy criterion formula (9) is modified to
maxΦ(D,R),Φ=αD-(1-α)R (9)
Equation (8) is a special form of equation (9), where α is 0.5.
In the application of the method of the invention, an incremental search algorithm is used to select the effective features that satisfy the mathematical relationship defined by the variable Φ. Using variable X to represent original userSet of electrical characteristics, variable Sm-1Representing the selected user electrical feature set, wherein the variable comprises m-1 selected features, the m-th feature is selected based on the maximum correlation minimum redundancy criterion, namely the m-th feature is selected from the rest feature set { X-Sm-1Selecting a feature, x, that maximizes the variable phij=xmAnd xmSatisfies the following conditions:
Figure BDA0003219152910000081
the improved k-means clustering algorithm of the invention verifies the characteristic clustering effect as follows.
The method and the device perform clustering by respectively taking the characteristics in the N nested candidate characteristic subsets as clustering dimensions through improving a k-means clustering algorithm, and determine the simplified characteristic subsets by utilizing the measurement of clustering accuracy.
Forming a candidate feature set:
the method comprises the steps that a variable X is an original user electricity feature set and comprises N features, a variable S represents a selected user electricity feature set, and a variable Q represents a to-be-selected user electricity feature set; the detailed steps to form the candidate feature set are as follows:
1) and initializing the characteristics. Let Q be X, S be the empty set.
2) Let i equal to 1;
3) according to the formulas (4) and (5)
Figure BDA0003219152910000082
Mutual information I (x) of relevance measure between target user behavior category ci(ii) a c) Find a satisfying max [ I (x) ]i;c)]And representing the feature as
Figure BDA0003219152910000083
Order to
Figure BDA0003219152910000084
Figure BDA0003219152910000085
Figure BDA0003219152910000086
4) Let xi∈Qm-1,xj∈Sm-1(m 2, …, N), from Qm-1To find the feature that maximizes the calculated value of equation (10), which is expressed as
Figure BDA0003219152910000091
Order to
Figure BDA0003219152910000092
Figure BDA0003219152910000093
Candidate feature subset S to be obtainedm-1、SNPutting an alternative feature set S;
5) i is added with 1, the step 3) is returned until the set Q becomes an empty set and is cut off, the obtained candidate feature set S is a candidate feature set, all candidate feature subsets in the candidate feature set S are arranged in a descending order according to the calculation size of the formula (10), and N candidate feature subsets (the number of the feature subsets is consistent with the total number of the features) can be obtained in this way
Figure BDA0003219152910000094
And the relationship between the candidate subsets is:
Figure BDA0003219152910000095
② forming reduced feature subsets
The clustering accuracy A (S) is an index for evaluating the quality of the feature subset, and the calculation formula is as follows:
Figure BDA0003219152910000096
the specific steps of utilizing the improved k-means clustering algorithm to verify the feature subset and obtaining the optimal feature subset with the maximum clustering accuracy are as follows:
let F denote the feature set with the greatest clustering accuracy.
1) For the weighting factor alpha, the value interval is [0,1 ]](ii) a The initial value is 0 and is assigned with 0.25 as step length, namely alphaiThe variation of the weighting factors represents an improvement to the criterion of maximum correlation minimum redundancy, representing different emphasis on correlation and redundancy, 0, 0.25, …,1, 1 ≦ i ≦ 5.
2) Respectively obtaining and weighting factors alpha by adopting incremental search algorithm based on improved maximum correlation minimum redundancy criterioniA set of nested candidate feature subsets corresponding in value
Figure BDA0003219152910000097
J is more than or equal to 1 and less than or equal to N. Wherein i corresponds to a weight factor αi(α i0, 0.25, …,1, 1 ≦ i ≦ 5); j is a feature number; and N is the total number of the features.
3) For each weight factor alphaiCalculating the clustering accuracy of each feature subset by adopting an improved k-means clustering algorithm to obtain the corresponding nested candidate feature set in a mode of increasing 1 feature each time
Figure BDA0003219152910000098
And recording the maximum clustering accuracy and the corresponding characteristic subset.
4) And comparing the clustering accuracy test results of the feature subsets under all the weight factors to obtain the feature subset with the maximum clustering accuracy, namely the obtained simplified feature subset F.
The feature selection method based on the improved maximum correlation minimum redundancy criterion (mRMR) of the embodiment of the present invention is as follows.
Because the main idea of the invention is to select a reduced feature set which can be applied to the user electricity category cluster analysis from the user electricity utilization feature set, the variable x appearing hereinafter is referred to as an electricity utilization feature, and the variable c is referred to as a user category. The general flow of power usage feature selection is shown in fig. 2.
As can be seen from fig. 2, the method is mainly divided into 3 steps:
1) an original feature set is constructed.
2) The candidate feature subset is derived using an improved maximum correlation minimum redundancy criterion.
3) And clustering the obtained N nested candidate feature subsets by using an improved k-means algorithm, and determining the optimal feature subset by means of the measurement comparison of the clustering accuracy (namely the number of accurately classified users/the total number of users). The variable N in fig. 2 indicates the total number of features that the original feature set has, and N is 11.
The experiments of the above examples were verified as follows.
The daily electricity consumption data of 500 power consumers in a certain area is sampled every 15 minutes and is 96-point load data. After the daily electricity consumption data are normalized, an original feature set is constructed according to the electricity consumption features given in the table 1. Then extracting power utilization characteristics to obtain an optimal characteristic set F, performing improved k-means clustering, and calculating clustering accuracy according to an original load curve clustering and referring to a formula (10); the original load curve clustering result is that 96-point load sampling data of 500 users form a 500 x 96-dimensional load data matrix, and then an improved k-means clustering algorithm is adopted for clustering to obtain clustering groups and the number of users in each group.
The invention proposes a preliminary set of user power characteristics, as shown in table 1.
TABLE 1 subscriber power feature set
Figure BDA0003219152910000101
In table 1, P represents a load, and Q represents a used amount of electricity; subscripts sum, av, max, min represent total, mean, maximum, minimum, respectively; peak, val and sh respectively represent peak period, valley period and plateau period; peak, av.val, av.sh represent peak mean, valley mean, average mean, respectively; the electricity load peak-valley time period division of the Shanghai region referenced by the peak-valley time period division in the table.
The simulation is carried out by running a computer with 4GB internal memory at 2.0GHz CPU main frequency (Central Processing Unit) by means of a mathematic tool MATLAB, and all data are subjected to normalized Processing before the experiment.
1) Influence of feature number on clustering result
As can be seen from fig. 3, the clustering accuracy also shows an increasing trend as the feature number is increased, but when the clustering accuracy increases to a maximum value, and the feature number is increased, the clustering accuracy will be substantially unchanged or even decrease. This indicates that one added feature cannot increase the accuracy of clustering, and even adversely affects the clustering result. This result also confirms the necessity of performing feature selection; meanwhile, the phenomenon of characteristic selection overfitting also exists in the electricity utilization characteristic selection problem.
2) Influence of weight factors on clustering accuracy
And comparing the clustering accuracy of the candidate feature subset groups under each weight factor, and selecting the feature subset with the highest clustering accuracy as the optimal feature subset, as shown in table 2.
TABLE 2 maximum clustering accuracy feature subsets corresponding to different weighting factors
Figure BDA0003219152910000111
As can be seen from table 2, the clustering accuracy is the greatest when the weight factor α is 0.75, so that the optimal feature subset a is { a11, a4, a7 }. The clustering accuracy of comparing the values of alpha to be 0.5 and 0.75 can be known as follows: after the improvement operation is carried out by introducing the weight factors, not only the clustering accuracy is improved to 0.943 from the original 0.931, but also the feature dimension of the obtained optimal feature set is reduced to 3 dimensions from the original 5 dimensions. The improved method provided by the invention can describe the feature correlation and redundancy in a more detailed manner, can effectively reduce the redundant features among the features, and increases the clustering accuracy while realizing dimension reduction.
Then, the corresponding characteristics of all the users are obtained by using the selected optimal characteristic subset, 96-point load data of the users are replaced by the optimal characteristic subset, the electricity utilization behaviors of 500 users are analyzed by using an improved k-means algorithm, the optimal clustering number is determined to be 4, and the result is shown in a table 3; that is, the users are classified into 4 types, and as shown in fig. 4, 5, 6, and 7, the power consumption data of the power consumers in fig. 4, 5, 6, and 7 is normalized, and the white line in the figure represents typical power consumption behaviors of the various power consumers.
TABLE 3 determination of optimal cluster number for improved k-means clustering algorithm
Number of clusters k All data element similarities W (k) within a data cluster
2 0.09836
3 0.12785
4 0.07453
5 0.24787
6 0.44026
7 0.56782
8 0.49867
3) Influence of feature extraction on classification of user electricity consumption behaviors
The optimal characteristic subset obtained by selection is used for replacing 96-point load data of users, and power utilization behavior analysis is carried out on 500 power users; the method for analyzing the power consumption of the user based on the cloud computing is used as a comparison method 1, and is different from the technical scheme of the invention in that the method directly refers to the characteristics of daily load rate, valley power coefficient, flat section power consumption percentage and peak time power consumption rate to analyze the power consumption of the user; directly clustering 11 features of the original feature set without feature extraction as clustering analysis dimensions to serve as a comparative analysis method 2; the analysis results obtained are shown in table 4.
TABLE 4 comparison of the Properties of the different experimental methods
Method Clustering accuracy/% Number of clustering iterations Clustering time/s
Comparative method 1 82.38 18 0.909
Comparative method 2 88.10 10 0.405
The method of the invention 95.23 8 0.315
As can be seen from table 4, the difference between the clustering accuracy of the comparison method 1 and the method of the present invention is large, and the load curves have diversity due to the diversity of the power users, so it is very necessary to select features to obtain a suitable analysis feature set for different users in terms of analysis accuracy; the feature selection method provided by the invention can be used for selecting features according to the characteristics of different data sets to obtain a proper analysis feature set, so that the method can be well suitable for the analysis of user data sets.
4) Influence of whether feature value is extracted or not on calculation performance
As can be seen from table 4, the clustering iteration times and the clustering time of the method of the present invention are both smaller than those of the two comparison methods, and although the difference of the operation time is not significant due to the small amount of user data, the method of the present invention can be obtained to effectively select the precise and simple characteristics suitable for the power consumption data analysis, so as to reduce the clustering iteration times and the clustering time of the clustering analysis.
Meanwhile, because the clustering algorithm is based on k-means, the time complexity of the clustering method can be represented as O (mktt) by taking the k-means clustering method as an example, wherein m represents the number of data objects for carrying out clustering analysis operation, n represents the dimension of the data objects, k represents the number of clusters, and t represents the iteration times of the operation process of the clustering algorithm; therefore, when the number of the power consumers subjected to clustering analysis is the same as the number of the clustering centers, the method replaces 96-point load data by the optimal characteristics, so that the corresponding t x n value is reduced when the method is adopted, and the time complexity of the method is lower than that of a method for analyzing the power consumption data only by using a k-means method; meanwhile, the clustering accuracy of the method also keeps 95.23% of the accuracy of the direct clustering result of the original load curve, so that the method can effectively select simplified characteristics, realize the dimension reduction of the load curve while ensuring the clustering accuracy, and reduce the time complexity of operational analysis.

Claims (7)

1. A user electricity consumption behavior cluster analysis method is characterized by comprising the following steps:
s1, clustering the original load data of the users to obtain original user clustering grouping information; for the feature subset according to the candidate
Figure FDA0003219152900000015
Clustering the calculated user data to obtain candidate user clustering grouping information;
s2, comparing and analyzing the user clustering grouping information with the original user clustering grouping information to obtain the correct user number of clustering grouping according to a formula
Figure FDA0003219152900000011
Calculating clustering accuracy
Figure FDA0003219152900000012
S3, taking the next candidate feature subset
Figure FDA0003219152900000013
Returning to the step S2 until the clustering accuracy of all candidate feature subsets is obtained;
and S4, recording the maximum value of the clustering accuracy of all candidate feature subsets and the candidate feature subset corresponding to the maximum value, wherein the candidate feature subset is the reduced feature subset.
2. The user power consumption behavior cluster analysis method according to claim 1, wherein in step S1, the specific obtaining process of the original user cluster grouping information includes:
1) giving original load data of a user; maximum value k of typical electricity consumption behavior category number k of given usermaxMinimum value kmin
Figure FDA0003219152900000014
N is the total number of the original load data samples of the user;
2) with kminFinding a typical electricity consumption behavior clustering center of the user by using a k-means method for the initial value of k, and calculating the similarity W of all users in the typical electricity consumption behavior category of the electricity consumption user;
3) judging whether the k value is larger than kmaxIf less than kminAdding 1 to the value of (1), and turning to the step 2); otherwise, entering step 4);
4) taking the k value k corresponding to the minimum value of Wbest,kbestThe user typical electricity consumption behavior category optimal clustering number;
5) determining k by adopting a maximum-minimum distance algorithm according to the determined optimal clustering number of typical power consumption behavior categories of the userbestAnd finally, dividing the original load data of each user into the typical power consumption behavior categories of each user according to the principle of minimum distance to obtain the clustering grouping information of the original users.
3. The user electricity consumption behavior cluster analysis method according to claim 1, wherein the determining process of the number of users with correct cluster grouping in step S2 includes:
for the 1 st group of users in the original user clustering grouping information, acquiring the number of all the users in the 1 st group of users clustered into different groups of candidate user clustering grouping information, if the number clustered into the P-th group of the candidate user clustering grouping information is the maximum, recording the number as P, wherein P is the number of the users in the 1 st group of users with accurate classification; and repeating the steps until the number of accurately classified users in all groups in the original user clustering grouping information is obtained.
4. The user electricity consumption behavior cluster analysis method according to claim 1, wherein the obtaining process of the candidate feature subset comprises:
a1, constructing a user electricity utilization feature set;
a2, adopting the most basicIncremental search algorithm acquisition and weighting factor alpha for large correlation minimum redundancy criterioniA set of nested candidate feature subsets corresponding in value
Figure FDA0003219152900000021
Wherein,
Figure FDA0003219152900000022
all the characteristics are subsets of the user electricity utilization characteristic set; j is a feature number; n is the total number of features; i is more than or equal to 1 and less than or equal to M, and M is the number of weight factors; the weight factor alphaiThe value interval is [0,1 ]]。
5. The user electricity consumption behavior cluster analysis method according to claim 4, wherein the specific implementation process of the step A2 comprises:
1) let Q be X, S is the empty set; wherein X is a user electricity utilization characteristic set;
2) let i equal to 1;
3) calculating the electricity utilization characteristic x of the ith useriMutual information I (x) measured by correlation with target user electricity consumption behavior class ci(ii) a c) Find a satisfying max [ I (x) ]i;c)]And the characteristic is expressed as
Figure FDA0003219152900000023
Order to
Figure FDA0003219152900000024
Wherein,
Figure FDA0003219152900000025
4) let xi∈Qm-1J-th user electricity usage characteristic xj∈Sm-1M 2, …, N from Qm-1In search for
Figure FDA0003219152900000026
Calculating the feature with the largest value, and expressing the feature as
Figure FDA0003219152900000027
Order to
Figure FDA0003219152900000028
Candidate feature subset S to be obtainedm-1、SNPutting an alternative feature set S;
5) adding 1 to the value of i, returning to the step 3), until the set Q is an empty set, then the obtained alternative feature set S is a candidate feature set, and all candidate feature subsets in the candidate feature set S are classified according to the formula
Figure FDA0003219152900000029
The sizes of the N candidate feature subsets are arranged in a descending order to obtain N candidate feature subsets
Figure FDA00032191529000000210
And the relationship between the candidate feature subsets is:
Figure FDA0003219152900000031
6. a user power consumption behavior clustering analysis system is characterized by comprising computer equipment; the computer device is configured or programmed for carrying out the steps of the method according to one of claims 1 to 5.
7. A computer-readable storage medium comprising a program running on a processor; the program is configured or programmed for carrying out the steps of the method according to one of claims 1 to 5.
CN202110952732.0A 2021-08-19 2021-08-19 User electricity consumption behavior clustering analysis method, system and storage medium Pending CN113610182A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110952732.0A CN113610182A (en) 2021-08-19 2021-08-19 User electricity consumption behavior clustering analysis method, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110952732.0A CN113610182A (en) 2021-08-19 2021-08-19 User electricity consumption behavior clustering analysis method, system and storage medium

Publications (1)

Publication Number Publication Date
CN113610182A true CN113610182A (en) 2021-11-05

Family

ID=78341217

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110952732.0A Pending CN113610182A (en) 2021-08-19 2021-08-19 User electricity consumption behavior clustering analysis method, system and storage medium

Country Status (1)

Country Link
CN (1) CN113610182A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030204484A1 (en) * 2002-04-26 2003-10-30 International Business Machines Corporation System and method for determining internal parameters of a data clustering program
CN111860600A (en) * 2020-06-22 2020-10-30 国家电网有限公司 User electricity utilization characteristic selection method based on maximum correlation minimum redundancy criterion

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030204484A1 (en) * 2002-04-26 2003-10-30 International Business Machines Corporation System and method for determining internal parameters of a data clustering program
CN111860600A (en) * 2020-06-22 2020-10-30 国家电网有限公司 User electricity utilization characteristic selection method based on maximum correlation minimum redundancy criterion

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
于文龙: "基于用电特征选择的用户用电行为聚类分析及应用", 《中国优秀硕士学位论文全文数据库》 *
李扬等: "基于改进最大相关最小冗余判据的暂态稳定评估特征选择", 《中国电机工程学报》 *

Similar Documents

Publication Publication Date Title
CN109871860B (en) Daily load curve dimension reduction clustering method based on kernel principal component analysis
Yu et al. Self-paced learning for k-means clustering algorithm
CN110222745A (en) A kind of cell type identification method based on similarity-based learning and its enhancing
CN108681742B (en) Analysis method for analyzing sensitivity of driver driving behavior to vehicle energy consumption
CN112529638B (en) Service demand dynamic prediction method and system based on user classification and deep learning
CN111338950A (en) Software defect feature selection method based on spectral clustering
CN111309577B (en) Spark-oriented batch application execution time prediction model construction method
CN115829683A (en) Power integration commodity recommendation method and system based on inverse reward learning optimization
CN118094354A (en) Hierarchical cascading power sensitive data classification model construction method and device
CN113591947A (en) Power data clustering method and device based on power consumption behaviors and storage medium
Chang et al. Gene clustering by using query-based self-organizing maps
CN105760471B (en) Based on the two class text classification methods for combining convex linear perceptron
CN112347162A (en) Multivariate time sequence data rule mining method based on online learning
Ceccarelli et al. Improving fuzzy clustering of biological data by metric learning with side information
CN112149052A (en) Daily load curve clustering method based on PLR-DTW
CN114372835B (en) Comprehensive energy service potential customer identification method, system and computer equipment
CN113610182A (en) User electricity consumption behavior clustering analysis method, system and storage medium
CN115527610A (en) Cluster analysis method of unicellular omics data
CN114974462A (en) Method, device and equipment for training corrosion inhibition efficiency prediction model and storage medium
CN110265151B (en) Learning method based on heterogeneous temporal data in EHR
CN109493249B (en) Analysis method of electricity consumption data on multiple time scales
CN108280531B (en) Student class score ranking prediction method based on Lasso regression
CN111310842A (en) Density self-adaptive rapid clustering method
CN110827919A (en) Dimension reduction method applied to gene expression profile data
CN111382273A (en) Text classification method based on feature selection of attraction factors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination