CN110781332A

CN110781332A - Clustering method of daily load curve of electric residential users based on compound clustering algorithm

Info

Publication number: CN110781332A
Application number: CN201910983879.9A
Authority: CN
Inventors: 游文霞; 金之榆
Original assignee: China Three Gorges University CTGU
Current assignee: China Three Gorges University CTGU
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2020-02-11

Abstract

The daily load curve clustering method of electric residential users based on compound clustering algorithm obtains the daily load data of electric residential users. The data contains P samples, and each sample has a data set matrix of Q time point attributes; The daily load data is preprocessed to obtain the initial cluster; the dimensionality reduction process is performed on the initial cluster to obtain the dimensionality reduction cluster; the clustering algorithm 1 is used to perform preliminary clustering on the dimensionality reduction cluster to obtain the initial cluster center; the clustering algorithm 2 is used to obtain the initial clustering center. The initial clustering centers obtained by clustering algorithm 1 are clustered, and the clustering effectiveness index is used to evaluate the clustering results, and finally M clustering centers are obtained; the obtained M clustering centers are used as the clustering algorithm. 2 is the initial clustering center, and the data is clustered to obtain user groups with similar behaviors. The invention clusters the huge and scattered daily load data into user groups with similar behaviors. The power enterprise managers can analyze the clustered user groups to better predict the peak and trough of electricity consumption, and provide a more reliable method for the management of power business.

Description

Clustering method of daily load curve of electric residential users based on compound clustering algorithm

技术领域technical field

本发明涉及电力居民用户用电技术领域，尤其是一种基于复合聚类算法的电力居民用户日负荷曲线聚类方法。The invention relates to the technical field of electricity consumption of electric residential users, in particular to a method for clustering daily load curves of electric residential users based on a compound clustering algorithm.

背景技术Background technique

随着电力行业的快速发展，以及智能电表的普及，获取电力居民用户的用电情况变得更加方便，同时，电力公司会获得更加庞大以及详细的用户用电数据。With the rapid development of the power industry and the popularization of smart meters, it has become more convenient to obtain the electricity consumption of electricity households.

面对庞大的用电数据，利用现有的数据挖掘和分析技术，对电力用户日负荷数据进行规律分析以及特征提取，从而便于电力公司根据电价政策为用户提供更加高质量的供电服务。其中，居民用户用电细分是电力公司提供优质服务的重要方面，面对日益增长的居民用户用电负荷那比例的增加，使用合理，高效的数据聚类算法对用户进行分析可以帮助电力公司根据用户的特征提供更加合理，个性化的供电方案，让用户获得更好的体验。In the face of huge electricity consumption data, the existing data mining and analysis technology is used to conduct regular analysis and feature extraction on the daily load data of power users, so as to facilitate the power company to provide users with higher quality power supply services according to the electricity price policy. Among them, the subdivision of residential users' electricity consumption is an important aspect of the power company's provision of high-quality services. In the face of the increasing proportion of residential users' electricity load, the use of reasonable and efficient data clustering algorithms to analyze users can help power companies. Provide more reasonable and personalized power supply solutions according to the characteristics of users, so that users can get a better experience.

但是，单一的原始的聚类算法聚类效率低，聚类效果差，例如，K-means算法由于对于初始聚类中心的选择是随机的，这使得对于样本数据量大的数据集，容易使聚类结果陷入局部最优。无法确定最佳聚类数目，需要研究人员逐个测试，导致聚类效率低下。从而不能很好地反映用户用电数据中的潜在规律以及用电特征，从而无法为电力公司在居民用户聚类方面提供良好的支持。However, the single original clustering algorithm has low clustering efficiency and poor clustering effect. For example, the K-means algorithm selects the initial clustering center randomly, which makes it easy to use the data set with a large amount of sample data. The clustering results fall into local optimum. Unable to determine the optimal number of clusters, researchers need to test one by one, resulting in low clustering efficiency. Therefore, the potential laws and electricity consumption characteristics in the user's electricity consumption data cannot be well reflected, so that the power companies cannot provide good support for the clustering of residential users.

发明内容SUMMARY OF THE INVENTION

本发明提供一种基于复合聚类算法的电力居民用户日负荷曲线聚类方法，该方法根据实时采集用电负荷的智能电表中的数据，对负荷曲线进行聚类，进而将有相同用电行为的用户聚到一起。The present invention provides a method for clustering daily load curves of electric residential users based on a compound clustering algorithm. The method clusters the load curves according to the real-time data collected in the smart meter of the electricity load, and then will have the same electricity consumption behavior. users get together.

本发明采取的技术方案为：The technical scheme adopted in the present invention is:

基于复合聚类算法的电力居民用户日负荷曲线聚类方法，包括以下步骤：The clustering method of daily load curve of electric residential users based on compound clustering algorithm includes the following steps:

步骤1：获取电力居民用户日负荷数据，该数据包含有P个样本，每个样本有Q个时间点属性的数据集矩阵；Step 1: Obtain the daily load data of electric residential users, the data contains P samples, and each sample has a data set matrix of Q time point attributes;

步骤2：对电力居民用户日负荷数据进行预处理，获得初始群集；Step 2: Preprocess the daily load data of electric residential users to obtain the initial cluster;

步骤3：对初始群集进行降维处理，获得降维群集；Step 3: Perform dimensionality reduction processing on the initial cluster to obtain a dimensionality reduction cluster;

步骤4：采用聚类算法1对降维群集进行初步聚类，得到初始聚类中心；Step 4: Use clustering algorithm 1 to perform preliminary clustering on the dimensionality reduction cluster to obtain the initial cluster center;

步骤5：采用聚类算法2对聚类算法1得到的初始聚类中心进行聚类，并使用聚类有效性指标，对聚类结果进行评估，最终得到M个聚类中心；Step 5: Use clustering algorithm 2 to cluster the initial clustering centers obtained by clustering algorithm 1, and use the clustering validity index to evaluate the clustering results, and finally obtain M clustering centers;

步骤6：采用步骤5得到的M个聚类中心，作为聚类算法2的初始聚类中心，对数据进行聚类，获得行为相似的用户群，并对获得的行为相似的用户群进行行为特征分析。Step 6: Use the M clustering centers obtained in step 5 as the initial clustering centers of clustering algorithm 2 to cluster the data to obtain user groups with similar behaviors, and conduct behavioral characteristics for the obtained user groups with similar behaviors. analyze.

所述步骤1中，对于P个样本，每个样本有Q个时间点属性的电力居民用户日负荷数据集，具体包括：In the step 1, for P samples, each sample has a daily load data set of electric residential users with Q time-point attributes, which specifically includes:

P个样本为居民用户样本，居民生活主要受季节变化、气温变化、收人水平、空调、电炊拥有率等因素影响，不用的因素会导致不同的日负荷曲线；Q为每日各个时间点由智能电表采集的该时间点的用电功率，Q的值根据智能电表采集数据的时间间隔而定。P samples are resident user samples. Residents’ lives are mainly affected by factors such as seasonal changes, temperature changes, income levels, air conditioners, and the ownership rate of electric cookers. Unused factors will lead to different daily load curves; Q is each time point of the day. For the power consumption at this time point collected by the smart meter, the value of Q is determined according to the time interval at which the smart meter collects data.

所述步骤2中，预处理包括缺失值处理、数据标准化、数据正则化处理；In the step 2, the preprocessing includes missing value processing, data standardization, and data regularization processing;

缺失值处理，对含有较多缺失值的数据进行删除，对含有较少缺失值的数据进行补全；Missing value processing, delete data with more missing values, and complete data with fewer missing values;

数据标准化，将原始数据线性化的方法转换到[0，1]的范围；Data normalization, converting the original data linearization method to the range of [0, 1];

数据正则化处理，将每个属性减去该属性对应的均值，然后，再除以该属性对应方差。For data regularization, each attribute is subtracted from the mean corresponding to the attribute, and then divided by the corresponding variance of the attribute.

所述步骤3中，降维处理采用PCA(Principal Component Analysis)，即主成分分析方法；获取的降维集群为p个样本、每个样本有q个属性的数据集矩阵。In the step 3, the dimensionality reduction process adopts PCA (Principal Component Analysis), that is, the principal component analysis method; the obtained dimensionality reduction cluster is a data set matrix of p samples and each sample has q attributes.

所述步骤4中，采用聚类算法1对降维群集做初步聚类，获得行为相似的用户群，具体包括，采用Mean-shift算法，将数据集中的p个样本聚成N类，其中，N为正整数。In the step 4, the clustering algorithm 1 is used to perform preliminary clustering on the dimensionality reduction clusters to obtain user groups with similar behaviors, which specifically includes: using the Mean-shift algorithm to cluster p samples in the data set into N categories, wherein, N is a positive integer.

所述步骤5中，采用聚类算法2对聚类算法1得到的聚类中心进行聚类，采用聚类有效性指标评估聚类结果，具体包括：采用K-means算法对Mean-shift算法得到的N个聚类中心进行聚类，在聚类数目N范围内，对[2，N]分别聚类，其中，N为正整数，并使用Calinski-Harabasz(CH)指标对聚类结果进行评估，选取CH值最大的结果，最终得到M个聚类中心，其中，M为[2，N]中的正整数。In the step 5, the cluster centers obtained by the clustering algorithm 1 are clustered by using the clustering algorithm 2, and the clustering result is evaluated by using the clustering validity index, which specifically includes: using the K-means algorithm to obtain the mean-shift algorithm. In the range of the number of clusters N, cluster [2, N] respectively, where N is a positive integer, and use the Calinski-Harabasz (CH) index to evaluate the clustering results , select the result with the largest CH value, and finally get M cluster centers, where M is a positive integer in [2, N].

所述步骤6中，采用得到的M个聚类中心作为K-means算法的初始聚类中心，对数据集中的每个样本，即每个用户或每条记录进行聚类，最后得到M个类的用户。In the step 6, the obtained M cluster centers are used as the initial cluster centers of the K-means algorithm, and each sample in the data set, that is, each user or each record, is clustered, and finally M classes are obtained. User.

本发明一种基于复合聚类算法的电力居民用户日负荷曲线聚类方法，以电力居民用户日负荷数据为分析对象，通过数据预处理，数据降维，以及特征聚类等多个算法过程，其中，特征聚类算法优选Mean-shift算法与K-means算法相结合。把庞大零散的日负荷数据聚类成行为相似的用户群。电力企业管理人对聚类成的用户群进行分析，可以更好地预测用电量高峰和低谷，为电力业务的管理提供更可靠地方法，为电力客户提供更优质的服务。The present invention is a method for clustering the daily load curve of electric residential users based on a compound clustering algorithm. The daily load data of electric residential users is taken as the analysis object, and through multiple algorithm processes such as data preprocessing, data dimension reduction, and feature clustering, etc. Among them, the feature clustering algorithm is preferably a combination of the Mean-shift algorithm and the K-means algorithm. Cluster the huge and scattered daily load data into user groups with similar behaviors. By analyzing the clustered user groups, the managers of power enterprises can better predict the peaks and troughs of electricity consumption, provide more reliable methods for the management of power business, and provide better services for power customers.

附图说明Description of drawings

图1为本发明方法实施例1的流程图。FIG. 1 is a flow chart of Embodiment 1 of the method of the present invention.

图2为本发明方法实施例2的流程图。FIG. 2 is a flow chart of Embodiment 2 of the method of the present invention.

具体实施方式Detailed ways

实施例1：Example 1:

步骤4：采用聚类算法1对降维群集进行初步聚类，得到N个初始聚类中心；Step 4: Use clustering algorithm 1 to perform preliminary clustering on the dimensionality reduction clusters to obtain N initial cluster centers;

实施例2：Example 2:

首先，获取电力居民用户日负荷数据，该数据包含有P个样本、每个样本有Q个时间点属性的数据集矩阵。First, the daily load data of electric residential users is obtained, which contains a data set matrix with P samples and Q time point attributes for each sample.

一般情况下，电网公司营销系统经过的数据集包括数万或更多的样本，每个样本为一个电力居民用户，随着智能电表的普及，统计每个用户的居民用户日负荷数据变得非常容易。In general, the data set passed by the power grid company's marketing system includes tens of thousands or more samples, each of which is a residential electricity user. With the popularization of smart meters, it has become very difficult to count the daily load data of each user's residential users. easy.

然后，对获取的电力居民用户日负荷数据进行预处理，获得初始群集，其中，本实施例中，预处理过程包括对电力居民用户日负荷用电数据进行缺失值处理，数据标准化，数据正则化以及数据降维，经过以上处理后，获得的初始群集为p个样本、每个样本有q个属性的数据集矩阵。Then, perform preprocessing on the acquired daily load data of electric residential users to obtain an initial cluster, wherein, in this embodiment, the preprocessing process includes performing missing value processing, data standardization, and data regularization on the electric residential user daily load power consumption data As well as data dimensionality reduction, after the above processing, the initial cluster obtained is a dataset matrix of p samples and each sample has q attributes.

其中，缺失值处理具体为，对有效值少的样本进行删除，对有效值多的样本的缺失值进行补全。当然，在删除有效值少的属性时，可一并将冗余属性进行删除。Specifically, the missing value processing is to delete the samples with few valid values, and complete the missing values of the samples with many valid values. Of course, when deleting attributes with few valid values, redundant attributes can be deleted together.

删除样本的过程中，若删除n个样本，则剩余p个样本，其中，p＝P-n。另外，对缺失值进行补充的方式有多种，本申请中，对已有有效性取其平均作为缺失值的填充值。本领域技术人员可根据选择其他补充方法，其不均不影响之后的分析过程。In the process of deleting samples, if n samples are deleted, p samples remain, where p=P-n. In addition, there are various ways to supplement the missing values. In this application, the average of the existing validity is taken as the filling value of the missing value. Those skilled in the art can choose other supplementary methods according to their non-uniformity and will not affect the subsequent analysis process.

数据标准化具体为，将原始数据线性化的方法转换到[0，1]的范围，最大-最小归一化的计算公式为The data normalization is specifically, the method of linearizing the original data is converted to the range of [0, 1], and the calculation formula of the maximum-minimum normalization is:

该方法实现对原始数据的等比例缩放，其中，X_norm为归一化后的数据，X为原始数据，X_max、X_min分别为原始数据集的最大值和最小值。The method realizes equal scaling of the original data, wherein X _norm is the normalized data, X is the original data, and X _max and X _min are the maximum and minimum values of the original data set, respectively.

数据正则化具体为，将每个属性减去该属性对应的均值，然后，再除以该属性对应方差。经过标准化与正则化处理后，每个属性的数据都聚集在0附近，且方差为1，即获得的样本数据具有零均值和单位方差。Data regularization is performed by subtracting the mean corresponding to the attribute from each attribute, and then dividing by the corresponding variance of the attribute. After standardization and regularization, the data of each attribute are clustered around 0 and the variance is 1, that is, the obtained sample data has zero mean and unit variance.

数据降维具体为，采用PCA(Principal Component Analysis)，即主成分分析方法，对数据集进行降维，得到处理后的降维群集R。The data dimensionality reduction is specifically, using PCA (Principal Component Analysis), that is, principal component analysis method, to reduce the dimensionality of the data set, and obtain the processed dimensionality reduction cluster R.

降维的过程中，若降维数为q，那么降维后的降维群集R则为p个样本、每个样本有q个属性的数据集矩阵。In the process of dimensionality reduction, if the number of dimensionality reduction is q, then the dimensionality reduction cluster R after dimensionality reduction is a dataset matrix with p samples and each sample has q attributes.

之后，采用聚类算法，对数据集R内的数据进行聚类，获得用电行为相似的的用户群，具体包括，首先，采用Mean-shift算法，将数据集中的p个样本聚成N类，其中N为正整数，然后采用K-means算法对Mean-shift算法得到的N个聚类中心进行聚类，在聚类数目N范围内，对[2，N]分别聚类，其中，N为正整数，并使用Calinski-Harabasz(CH)指标对聚类结果进行评估，选取CH值最大的结果，最终得到M个聚类中心，其中M为[2，N]中的正整数，最后，采用得到的M个聚类中心作为K-means算法的初始聚类中心对数据集中的每个样本，即每个用户或每条记录，进行聚类，最后得到M个类的用户。Afterwards, the clustering algorithm is used to cluster the data in the data set R, and the user groups with similar electricity consumption behaviors are obtained, which includes, first, the Mean-shift algorithm is used to cluster the p samples in the data set into N categories , where N is a positive integer, and then the K-means algorithm is used to cluster the N cluster centers obtained by the Mean-shift algorithm, and within the range of the number of clusters N, cluster [2, N] respectively, where N is a positive integer, and the Calinski-Harabasz (CH) index is used to evaluate the clustering results, select the result with the largest CH value, and finally get M cluster centers, where M is a positive integer in [2, N], and finally, The obtained M cluster centers are used as the initial cluster centers of the K-means algorithm to perform clustering on each sample in the data set, that is, each user or each record, and finally obtain M classes of users.

本实施例中，Mean-shift算法的具体过程包括：In this embodiment, the specific process of the Mean-shift algorithm includes:

首先，从数据集中找到任意一样本i，对该样本点进行均值漂移向量计算并改变当前中心点位置；然后，平移窗口，重新计算概率密度；最终收敛到概率密度极大值处，Mean-shift处理数据集R中的下一个对象。First, find any sample i from the data set, calculate the mean shift vector of the sample point and change the current center point position; then, shift the window and recalculate the probability density; finally converge to the maximum value of the probability density, Mean-shift Process the next object in the dataset R.

同一个类中的数据属性值越相似或者相等，这个类中的样本密度就越大。每个行为相似的用户群称为一个类，最终获得多个相似的类，每个类都有其中心样本点，用户群依次命名为类1，类2…，类N。The more similar or equal the data attribute values in the same class, the greater the density of samples in this class. Each user group with similar behavior is called a class, and finally multiple similar classes are obtained, each class has its central sample point, and the user groups are named as class 1, class 2..., class N in turn.

本实施例中，K-means算法的具体过程包括：In this embodiment, the specific process of the K-means algorithm includes:

第一步，将N个中心样本点记为X＝{x₁,x₂,...,x_N}，从集群X中任意找到k个点Y＝{y₁,y₂,...,y_k}作为聚类中心，其中k属于[2，N]；The first step is to record the N central sample points as X={x ₁ , x ₂ ,...,x _N }, and find k points Y={y ₁ , y ₂ ,... ,y _k } as the cluster center, where k belongs to [2, N];

第二步，计算集群X中的每个点到Y中k个聚类中心点的距离，并将其分到距离最小的聚类中心点所对应的类中；The second step is to calculate the distance from each point in cluster X to the k cluster center points in Y, and classify it into the class corresponding to the cluster center point with the smallest distance;

第三步，对每个聚类中心进行重新计算；The third step is to recalculate each cluster center;

第四步，重复第二步和第三步直到聚类中心的位置不再变化；The fourth step, repeat the second and third steps until the position of the cluster center does not change;

第五步，计算出对应k值的Calinski-Harabasz(CH)指标；The fifth step is to calculate the Calinski-Harabasz (CH) index corresponding to the k value;

第六步，使用对于从2到N的每个k值，重复第一步到第五步，选取Calinski-Harabasz(CH)指标最大值对应的k值记为K，对应聚类中心记为Z＝{Z₁,Z₂,...,Z_K}；The sixth step is to use for each k value from 2 to N, repeat the first step to the fifth step, select the k value corresponding to the maximum value of the Calinski-Harabasz (CH) index as K, and the corresponding cluster center as Z ={Z ₁ ,Z ₂ ,...,Z _K };

第七步，计算数据集R中每个点到Z中K个聚类中心点的距离，并将其分到距离最小的聚类中心点所对应的类中；The seventh step is to calculate the distance from each point in the data set R to the K cluster center points in Z, and divide it into the class corresponding to the cluster center point with the smallest distance;

第八步，重复第七步和第三步直到聚类中心的位置不再变化；The eighth step, repeat the seventh and third steps until the position of the cluster center does not change;

第九步，输出聚类结果。The ninth step, output the clustering results.

本实施例中，Calinski-Harabasz(CH)指标的具体计算过程如下所示：In this embodiment, the specific calculation process of the Calinski-Harabasz (CH) indicator is as follows:

其中，g表示聚类的数目，h表示当前的类，trB(h)表示类间离差矩阵的迹，trW(h)表示类内离差矩阵的迹。CH越大代表着类自身越紧密，类与类之间越分散，即更优的聚类结果。Among them, g represents the number of clusters, h represents the current class, trB(h) represents the trace of the inter-class dispersion matrix, and trW(h) represents the trace of the intra-class dispersion matrix. The larger the CH, the closer the class itself is, and the more dispersed the classes are, that is, the better the clustering results.

Claims

1. The electric power resident user daily load curve clustering method based on the composite clustering algorithm is characterized by comprising the following steps of:

step 1: acquiring daily load data of power residents, wherein the data comprises P samples, and each sample has Q data set matrixes of time point attributes;

step 2: preprocessing daily load data of power resident users to obtain an initial cluster;

and step 3: performing dimensionality reduction on the initial cluster to obtain a dimensionality reduction cluster;

and 4, step 4: carrying out primary clustering on the dimensionality reduction cluster by adopting a clustering algorithm 1 to obtain an initial clustering center;

and 5: clustering the initial clustering centers obtained by the clustering algorithm 1 by adopting a clustering algorithm 2, and evaluating clustering results by using a clustering effectiveness index to finally obtain M clustering centers;

step 6: and 5, clustering the data by using the M clustering centers obtained in the step 5 as initial clustering centers of the clustering algorithm 2 to obtain user groups with similar behaviors.

2. The electric power resident user daily load curve clustering method based on the composite clustering algorithm as claimed in claim 1, wherein: in step 1, for P samples, each sample has Q time point attributes of the daily load data sets of the electric power residents, including: the P samples are resident user samples, the resident life is mainly influenced by factors such as seasonal changes, air temperature changes, people receiving level, air conditioning and electric cooking ownership, and different daily load curves can be caused by different factors; q is the power consumption at each time point every day collected by the intelligent electric meter at the time point, and the value of Q is determined according to the time interval of data collection of the intelligent electric meter.

3. The electric power resident user daily load curve clustering method based on the composite clustering algorithm as claimed in claim 1, wherein: in the step 2, the preprocessing comprises missing value processing, data standardization and data regularization processing;

missing value processing, namely deleting data with more missing values and completing data with less missing values;

data standardization, namely converting a method of raw data linearization into a range of [0, 1 ];

and (4) carrying out data regularization treatment, namely subtracting the mean value corresponding to each attribute from each attribute, and then dividing the mean value by the variance corresponding to the attribute.

4. The electric power resident user daily load curve clustering method based on the composite clustering algorithm as claimed in claim 1, wherein: in the step 3, pca (principal Component analysis), namely a principal Component analysis method, is adopted for the dimension reduction; the obtained dimensionality reduction cluster is a data set matrix with p samples and q attributes in each sample.

5. The electric power resident user daily load curve clustering method based on the composite clustering algorithm as claimed in claim 1, wherein: in the step 4, the dimensionality reduction cluster is subjected to preliminary clustering by adopting a clustering algorithm 1 to obtain a user group with similar behaviors, and the method specifically comprises the step of clustering p samples in a data set into N types by adopting a Mean-shift algorithm, wherein N is a positive integer.

6. The electric power resident user daily load curve clustering method based on the composite clustering algorithm as claimed in claim 1, wherein: in the step 5, clustering is performed on the clustering centers obtained by the clustering algorithm 1 by using the clustering algorithm 2, and clustering results are evaluated by using a clustering validity index, which specifically comprises: clustering N clustering centers obtained by the Mean-shift algorithm by adopting a K-means algorithm, respectively clustering [2 and N ] within a clustering number N range, wherein N is a positive integer, evaluating a clustering result by using a Calinski-Harabasz (CH) index, selecting a result with the largest CH value, and finally obtaining M clustering centers, wherein M is the positive integer in [2 and N ].

7. The electric power resident user daily load curve clustering method based on the composite clustering algorithm as claimed in claim 1, wherein: in the step 6, the obtained M clustering centers are used as initial clustering centers of the K-means algorithm, each sample in the data set, namely each user or each record, is clustered, and finally M classes of users are obtained.