CN116894744A

CN116894744A - Power grid user data analysis method based on improved k-means clustering algorithm

Info

Publication number: CN116894744A
Application number: CN202310898484.5A
Authority: CN
Inventors: 金家桢; 刘天慈; 成诚; 凌在汛; 马小强; 彭舜尧; 孔令威; 谌思桐; 冷小聪; 刘思杰; 洪度; 邓海伟; 何顺帆; 吴笑民; 邓桂平
Original assignee: Electric Power Research Institute of State Grid Hubei Electric Power Co Ltd; South Central University for Nationalities; Suizhou Power Supply Co of State Grid Hubei Electric Power Co Ltd
Current assignee: Electric Power Research Institute of State Grid Hubei Electric Power Co Ltd; Suizhou Power Supply Co of State Grid Hubei Electric Power Co Ltd; South Central Minzu University
Priority date: 2023-07-21
Filing date: 2023-07-21
Publication date: 2023-10-17

Abstract

The invention relates to the field of power customer service data analysis, and discloses a power grid user data analysis method based on an improved k-means clustering algorithm, which comprises the following steps: preprocessing customer power data; constructing a K-means clustering algorithm model; model improvement and optimization, namely, adopting a K mean value++ method to enable the distance between initial clustering centers to be as far as possible, visualizing the performance of the clustering model under different K values by using an elbow graph to determine an optimal clustering number range, calculating a score value to evaluate a clustering effect, and selecting a clustering number; and carrying out cluster analysis on the data set by using the optimized cluster model, wherein various types of cluster center corresponding samples are used as representatives of the users. The invention classifies the data by utilizing the customer electricity utilization data through the improved k-means clustering algorithm, thereby realizing intelligent analysis of the user behavior, realizing classification and prejudgment of the power grid user electricity utilization behavior, improving the working efficiency of power customer service and providing data support for the treatment of power problems.

Description

Power grid user data analysis method based on improved k-means clustering algorithm

Technical Field

The invention relates to the field of power customer service data analysis, and particularly discloses a power grid user data analysis method based on an improved k-means clustering algorithm.

Background

At present, the power customer service mainly relies on manual statistics and summarization of data analysts to carry out classification analysis on work orders, and the classification result is greatly influenced by subjective judgment of the staff and is difficult to discover key electricity utilization problems from a large amount of data. To increase the service capacity for power customers, it is desirable to provide a more efficient data processing scheme.

Cluster analysis is a common data analysis method, in which samples in a data set are divided into clusters by a clustering algorithm, and each cluster may correspond to a class of samples with the same characteristics. The k-means clustering algorithm is a clustering analysis algorithm for iterative solution, which is widely used due to simplicity and efficiency, but has some defects: the selected initial clustering center has a larger influence on the clustering result; the k value, i.e. the number of clusters, is difficult to predict and give.

CN114971411a discloses a method for analyzing the user's power-free behavior of a power distribution network based on data driving, which comprises the following steps: acquiring historical electricity utilization data of different users, and calculating corresponding power factor data; preprocessing the power factor data to obtain a power factor daily curve and power factor daily sequence data; step S3: performing dimension reduction and clustering on the power factor daily sequence data to obtain a classification result of the power factor daily curve; step S4: and performing non-functional electricity behavior analysis to obtain non-functional electricity utilization modes and non-functional characteristics of different users. The method analyzes the non-functional electricity behavior of typical users, and is beneficial to electricity management and economic and safe operation of the power distribution network; the multi-layer convolution self-encoder can effectively extract deep features, has high accuracy, performs cluster analysis on the deep features with low dimension, can reduce cluster time and improves cluster efficiency.

CN113886669a discloses a self-adaptive clustering method for electric power user portraits, which adopts an automatic encoder principle to realize feature extraction, uses a proper square loss function to reduce the dimension of high-dimension data, and obtains low-dimension information with higher information density; performing cluster analysis by adopting K-means algorithm operation, and obtaining initial cluster category in low dimensionality by the low dimensionality information; adopting a unimodal statistical test as a basic algorithm of fusion to carry out category fusion; feature extraction, cluster analysis and class fusion optimization are integrated, a cluster mode is constructed, a single peak statistical test value among classes is calculated after initial cluster classes are obtained, and the inter-class fusion is carried out according to the value; the method has the advantages that the proper number of class clusters is obtained under the condition that the number of the classes is not known in advance, and the clustering performance is effectively improved. The method solves the problems that the prior art replaces one cluster parameter with other parameters, has poor and satisfactory effect on constructing a cluster mode by large-scale high-dimension data, and the cluster performance is unsatisfactory.

However, the above-mentioned prior art cannot classify and predict the electricity consumption behavior of the power grid user, so as to improve the working efficiency of the power customer service and provide data support for the disposal of the power problem.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a power grid user data analysis method based on an improved k-means clustering algorithm, which processes the power consumption data of a customer to be converted into a standard format, classifies the data by utilizing the improved k-means clustering algorithm, and further realizes intelligent analysis of user behaviors.

A power grid user data analysis method based on an improved k-means clustering algorithm comprises the following steps:

s1, preprocessing customer power data;

s2, constructing a K-means clustering algorithm model;

s3, model improvement and tuning: in order to select reasonable initial clustering centers, a K mean value++ method is adopted to enable the distance between the initial clustering centers to be as far as possible, the performance of the clustering model under different K values is visualized by an elbow graph to determine an optimal clustering number range, a Calinski-Harabasz score value is calculated to evaluate the clustering effect, and the clustering number is selected;

and S4, performing cluster analysis on the data set by using the optimized cluster model, wherein various types of cluster center corresponding samples are used as representatives of the users.

Further, the above-mentioned method for analyzing the power grid user data based on the improved k-means clustering algorithm, wherein the step S1 specifically includes: extracting key information from the customer power data, sorting the key information into an array, wherein each element is a binary group, and normalizing and standardizing the data; and detecting and eliminating abnormal points of the data, and reducing the influence of the outliers on the mean value.

Further, in the above method for analyzing power grid user data based on improved k-means clustering algorithm, the step S2 establishes a k-means clustering algorithm model, and the algorithm is as follows:

setting a value k of a cluster number;

k samples are selected as initial clustering centers;

for each sample in the dataset, calculating its Euclidean distance to k cluster centers and classifying it into the class of cluster centers closest to it;

for each class, recalculating its cluster center;

the first two steps are repeated until a predetermined number of iterations is reached.

Further, the above-mentioned power grid user data analysis method based on the improved k-means clustering algorithm, and the specific process of step S3 is as follows:

firstly, selecting an initial clustering center of a data set to be analyzed, wherein the method comprises the following steps: randomly selecting a center point m ₁ Calculating the furthest distance D from the other data points to the previous n selected centers _i And with probabilitySelecting a new center point, and repeating until all initial cluster centers are selected;

the method for selecting the optimal cluster number comprises the following steps: and sequentially calculating the sum of squares of errors in the total class of the clustering results of each clustering number in a reasonable clustering number k value range:

wherein c _i For the collection of the ith cluster data points, m _i Is the cluster center of the ith cluster, p is c _i Data points, WCSS is p to m in all clusters _i Is the sum of the squares of the distances of (a).

Drawing a line graph of the cluster number k and the error square sum WCSS, namely an elbow graph, wherein the sudden and gentle part of the line drop rate is the elbow point of the elbow graph, taking the optimal cluster number in the range, further calculating the Calinski-Harabasz score value to evaluate the clustering effect, and selecting the cluster number.

Further, after the data set is subjected to cluster analysis, coordinate values of various cluster centers are compared with a data preprocessing rule, so that specific electricity utilization characteristics of samples corresponding to the cluster centers can be obtained, and the cluster centers are considered to be the same as the characteristics of the cluster centers.

The invention is based on the analysis of the data of the power grid user by improving the k-means clustering algorithm, can realize classification and prejudgment of the power utilization behavior of the power grid user, improves the working efficiency of the power customer service, and provides data support for the treatment of the power problem.

Drawings

FIG. 1 is a flowchart of a k-means clustering algorithm.

FIG. 2 is a graph of the sum of squares of the errors in the total class of different cluster numbers and corresponding cluster results, i.e., an elbow graph.

FIG. 3 is a graph of Calinski-Harabasz scores for different cluster numbers.

FIG. 4 is a graph of the results of a cluster analysis of grid user data using a modified k-means clustering algorithm.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1 to 4, an embodiment of the present invention provides a power grid user data analysis method based on an improved k-means clustering algorithm, including the following steps:

s1, preprocessing the customer power data. Specifically, the key information is extracted from the power data of the rural power grid customer in a certain area, and is organized into an array, wherein each element is a binary group, in this example [ date, power consumption ]. Further normalizing and normalizing the data; detecting and removing abnormal points of the data, and reducing the influence of the outliers on the mean value; .

S2, establishing a k-means clustering algorithm model, wherein the algorithm is as shown in fig. 1:

setting a value k of a cluster number;

k samples are selected as initial clustering centers;

for each class, recalculating its cluster center;

the first two steps are repeated until the number of iterations is 20.

S3, improving and optimizing the model, wherein the specific process is as follows:

A graph of the clustering number k and WCSS, i.e., an elbow graph, is drawn as shown in fig. 2. The point where the fold line drop rate suddenly becomes gentle, i.e., the elbow point of the elbow graph, "elbow", is considered to be the vicinity where the optimal number of clusters is obtained. Number of Clusters in fig. 2 refers to the number of families.

Further calculating Calinski-Harabasz score values obtained by clustering with different cluster numbers in the range, evaluating the clustering effect, and selecting 6 as the optimal cluster number when the cluster number is 6 as shown in figure 3.

And S4, performing cluster analysis on the data set by using the optimized cluster model, wherein various types of cluster center corresponding samples are used as representatives of the users. The specific process is as follows: the clustering number is set to be 6, an initial clustering center is selected, and clustering analysis is carried out on the data set, and the result is shown in fig. 4. And comparing the coordinate values of the clustering centers of various types with a data preprocessing rule to obtain specific electricity utilization characteristics of samples corresponding to the clustering centers, wherein the clustering centers are considered to be the same as the characteristics of the clustering centers.

In this example, 6 cluster centers are obtained:

[5.25,3.70],[7.69,2.00],[4.70,0.54],[9.25,0.34],[6.27,0.63],[7.73,0.20]。

comparison of raw data yields: the clustering center 1 is used for agricultural irrigation and drainage users to drain irrigation electricity, and each sample in the clustering center is used for agricultural irrigation and drainage electricity, and is characterized in that a pump station is required to work at high strength in the agricultural production within 5-6 months, so that the electricity consumption is high; the clustering center 2 is used for power utilization of residents of a certain user in summer, and the power utilization mode of each sample in the clustering center is close to that of the residents of the city, and the power utilization amount of high-power electric appliances such as an air conditioner is large; other clustering centers are all the life electricity utilization of ordinary rural residents.

The invention analyzes the power grid user data based on the improved k-means clustering algorithm, realizes classification and prejudgment of the power grid user electricity consumption behavior, improves the working efficiency of power customer service, and provides data support for the treatment of power problems.

The foregoing is merely illustrative embodiments of the present invention, and the present invention is not limited thereto, and any changes or substitutions that may be easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A power grid user data analysis method based on an improved k-means clustering algorithm is characterized by comprising the following steps of: the method comprises the following steps:

s1, preprocessing customer power data;

s2, constructing a K-means clustering algorithm model;

2. The method for analyzing the power grid user data based on the improved k-means clustering algorithm as claimed in claim 1, wherein the method comprises the following steps of: the step S1 specifically includes: extracting key information from the customer power data, sorting the key information into an array, wherein each element is a binary group, and normalizing and standardizing the data; and detecting and eliminating abnormal points of the data, and reducing the influence of the outliers on the mean value.

3. The method for analyzing the power grid user data based on the improved k-means clustering algorithm as claimed in claim 1, wherein the method comprises the following steps of: the step S2 model algorithm is as follows:

setting a value k of a cluster number;

k samples are selected as initial clustering centers;

for each sample in the dataset, calculating Euclidean distances from each sample to k cluster centers and classifying the Euclidean distances into the class of the cluster center closest to the Euclidean distances;

for each class, recalculating its cluster center;

4. The method for analyzing the power grid user data based on the improved k-means clustering algorithm as claimed in claim 1, wherein the method comprises the following steps of: the specific process of the step S3 is as follows:

drawing a line graph of the cluster number k and the error square sum WCSS, namely an elbow graph, wherein the sudden and gentle position of the fold line descent rate, namely the elbow point of the elbow graph, obtaining the optimal cluster number in the range, further calculating the Calinski-Harabasz score value to evaluate the clustering effect, and selecting the cluster number.

5. The method for analyzing the power grid user data based on the improved k-means clustering algorithm as claimed in claim 1, wherein the method comprises the following steps of: the specific process of the step S4 is as follows: after the data set is subjected to cluster analysis, coordinate values of various cluster centers are compared with a data preprocessing rule to obtain specific electricity utilization characteristics of samples corresponding to the cluster centers, and the clusters are considered to be the same as the characteristics of the cluster centers.