CN109871860B

CN109871860B - Daily load curve dimension reduction clustering method based on kernel principal component analysis

Info

Publication number: CN109871860B
Application number: CN201811302328.3A
Authority: CN
Inventors: 李欣然; 何聪; 钟伟; 宋军英; 汤杰; 徐飘
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2018-11-02
Filing date: 2018-11-02
Publication date: 2022-12-13
Anticipated expiration: 2038-11-02
Also published as: CN109871860A

Abstract

The invention discloses a daily load curve dimensionality reduction clustering method based on kernel principal component analysis, which comprises the steps of firstly utilizing a kernel function to carry out nonlinear mapping on a preprocessed daily load data matrix to a high-dimensional space to obtain a high-dimensional kernel matrix; then, correcting the high-dimensional kernel matrix and performing eigenvalue decomposition to obtain corresponding eigenvalues and unitized eigenvectors, setting extraction efficiency according to the descending trend of the eigenvalues, and obtaining principal component components and the number of dimension reduction indexes; secondly, taking the characteristic value as a weight, carrying out normalization processing with the sum of the weight as 1 to obtain a weight vector, and taking the projection of the corrected Gaussian kernel matrix on the principal component as a dimension reduction data matrix; and finally, clustering the daily load curve by taking the dimension reduction data matrix and the weight vector as the input of a weighting algorithm, and determining the optimal clustering number and the clustering result based on the Silhouette index. The method improves the clustering quality while improving the calculation efficiency. The clustering result is consistent with the reality, and has a certain engineering value.

Description

Daily load curve dimension reduction clustering method based on kernel principal component analysis

Technical Field

The invention belongs to the technical field of analysis and control of power systems, and mainly relates to a daily load curve dimension reduction clustering method based on kernel principal component analysis.

Background

Daily load curve clustering is the basis of power distribution and power distribution big data mining, and has certain guiding significance on load prediction, power grid planning and demand side response. With the continuous advance of smart power grids, the informatization degree of a power system is continuously improved, a power utilization information acquisition system, a distribution network GIS system, a distribution network automation system and the like are gradually improved, and power distribution and utilization data show big data characteristics of large data volume, multiple types, quick growth and the like. How to adopt an effective data mining technology and finely divide mass users of different types under the background of big data so as to mine the internal relation among loads of different types and the corresponding information such as power utilization behaviors, power utilization characteristics and the like, and the method has important significance to power grid companies and power users.

In a traditional daily load curve clustering method, after power values of sampling time points of a daily load curve are normalized through a maximum value, the daily load curve is clustered by adopting algorithms such as K-means and fuzzy C-means and the like and by taking Euclidean distance as a similarity criterion. The method has the following two disadvantages: 1) As for the load curves in the time series, the similarity between the curves is easily influenced by many factors such as air temperature and climate, income, electricity price policy and the like, cannot be fully reflected by distance, and the curves with obvious load shapes like daily load curves have undesirable equidistance under the high-dimensional condition, so that the load curves are wrongly divided; 2) With the increasing size of the load data, the method faces huge challenges in computational efficiency.

The indirect clustering method can effectively solve two defects of the traditional method, and dimension reduction processing is carried out on the daily load curve, and dimension reduction indexes are extracted to be used as input of a clustering algorithm. Two important problems are still faced: 1) Determining the dimensionality reduction indexes and the number; 2) And determining the weight of the dimension reduction index. By selecting the appropriate number of dimension reduction indexes and reasonably configuring the weight of each index, the accuracy and the efficiency of the daily load curve clustering result can be improved to a great extent.

Disclosure of Invention

The invention solves the technical problem of providing a daily load curve dimension reduction clustering method based on kernel principal component analysis, which aims at solving the problems in the existing daily load curve clustering method.

The technical scheme adopted by the invention is as follows:

a daily load curve dimension reduction clustering method based on kernel principal component analysis comprises the following steps:

step 1), preprocessing daily load power curve data: firstly, identifying and correcting data in a daily load power curve, identifying the data with the load power change rate exceeding a preset threshold value as abnormal data, correcting the curve with abnormal and missing data through a unitary three-point parabolic interpolation algorithm, then performing per-unit processing on the corrected daily load power curve data by taking the maximum power value of the daily load power curve as a reference value, and obtaining a normalized daily load power curve active power per-unit value matrix as an original data matrix;

step 2), according to the original data matrix, firstly selecting a Gaussian radial basis kernel function to perform nonlinear mapping on the Gaussian radial basis kernel function to a high-dimensional characteristic space to obtain a kernel matrix, and then correcting the kernel matrix;

step 3), performing principal component analysis on the corrected kernel matrix, and determining the number of principal component components and dimension reduction indexes: firstly, decomposing characteristic values to obtain characteristic values and characteristic vectors, then obtaining unitized characteristic vectors by a Gram-Schmidt orthogonal method, calculating the accumulated contribution rate corresponding to the characteristic values, then forming different fitting curves by fitting the characteristic values of which the quantity is sequentially increased, selecting the quantity of the characteristic values corresponding to the fitting curve with the smallest area of the graph surrounded by the x axis and the y axis as the quantity of characteristic indexes, taking the quantity as the quantity of dimension reduction indexes, and taking the unitized characteristic vectors corresponding to the quantity as principal component components;

step 4), a projection matrix obtained by projection of the corrected kernel matrix on the principal component is used as a dimensionality reduction data matrix obtained after dimensionality reduction through kernel principal component analysis, and then the eigenvalue corresponding to each principal component is used as a dimensionality reduction index weight, and meanwhile, a corresponding dimensionality reduction index weight vector is obtained;

and 5), combining the dimensionality reduction data matrix and the dimensionality reduction index weight vector, and clustering by a weighted k-means algorithm: firstly, randomly selecting a plurality of samples in a dimension reduction data matrix as initial clustering centers, then dividing each sample into clusters of the initial clustering centers with the minimum corresponding weighted Euclidean distance by combining dimension reduction indexes and index weight vectors, updating the clustering centers according to the sample condition in each formed cluster, then calculating the square error from each sample to the clustering centers, finishing clustering when the square error is smaller than the square error obtained after last iteration, and otherwise returning to the steps to perform a new round of iteration; and after clustering is finished, the Silhouette index of the clustering result under different clustering numbers is calculated, and the clustering number with the maximum Silhouette index value and the clustering result are taken as the best.

The daily load curve dimension reduction clustering method based on kernel principal component analysis comprises the following steps of 1):

step 1-1), identifying and correcting abnormal data in the daily load power curve:

firstly, abnormal data is identified, and the power value P of a certain load curve at each sampling time point is identified _k ＝[p _k，1 ,p _k,2 ···,p _k,m ] ^T And (3) identifying abnormal data according to the formula (1):

in the formula: delta _k,i The load power change rate of the load curve at the ith point is regarded as abnormal data after exceeding a preset threshold value epsilon;

then, correcting, and if the data loss and abnormal quantity of a certain load curve are not less than 10%, directly deleting the load curve;

if the data missing amount and the abnormal amount of a certain load curve are lower than 10%, interpolation fitting is carried out through a unitary three-point parabolic interpolation algorithm to obtain the values of the abnormal amount and the missing amount again, and the unitary three-point parabolic interpolation algorithm comprises the following steps:

let n nodes x _i (i =0,1, ·, n-1) has a function value y _i ＝f(x _i ) Has x ₀ ＜x ₁ ＜···＜x _n-1 Corresponding to function value y ₀ ＜y ₁ ＜···＜y _n-1 To calculate the approximate function value z = f (t) of the specified interpolation point t, the 3 nodes closest to t are selected: x is the number of _k-1 、x _k 、x _k+1 (x _k ＜t＜x _k+1 ) Then the value of z is calculated according to the formula (2) of parabolic interpolation, i.e.

In the formula, when | x _k -t|＜|t-x _k+1 I, |, m = k-1; when | x _k -t|＞|t-x _k+1 I, |, m = k;

if the interpolation point t is not in the interval containing n nodes, only 2 nodes at one end of the interval are selected for linear interpolation;

step 1-2), per-unit processing is carried out on the corrected daily load power curve data:

note P _k ＝[p _k1 ,...,p _ki ,...,p _km ]∈R ^1×m For the m-point original active power matrix of the corrected k-th daily load power curve, k =1,2,3, …, N is the total number of daily load power curves, p _ki The original active power of the ith point of the kth daily load power curve is i =1,2, …, and m is the number of sampling points; then P = [ P = ₁ ,...,P _k ,...,P _N ] ^T ∈R ^N×m An m-point original active power matrix for N daily load power curves, wherein R ^1×m Refers to a real number matrix of 1 row and m columns, R ^N×m A real number matrix with N rows and m columns;

taking the maximum power value p of daily load power curve _k.max ＝max{p _k1 ,p _k2 ,...,p _ki ,...,p _km The original data samples are subjected to per-unit processing according to equation (3) as a reference value,

p' _ki ＝p _ki /p _k·max (3)

obtaining a normalized active power per unit value matrix of a daily load power curve:

P' _k ＝[p' _k1 ,p' _k2 ,...,p' _ki ,…,p' _km ]∈R ^1×m and let the matrix be A ∈ R ^N×m Wherein N is the number of daily load curves.

The daily load curve dimension reduction clustering method based on kernel principal component analysis comprises the following steps of 2):

step 2-1), carrying out nonlinear mapping on the original data matrix A to a high-dimensional space to obtain a kernel matrix K:

let vector x _i ∈R ^1×m ,y _j ∈R ^1×m (i ≠ j) is respectively per-unit data of any two daily load curves in the original data matrix A, and nonlinear mapping is carried out on the matrix A according to a formula (4) by a Gaussian radial basis kernel function to obtain a kernel matrix K;

step 2-2), correcting the kernel matrix K:

the kernel matrix K is corrected according to the following formula (5):

in formula (5): i, j =1,2, ·, m, m is the daily load curve dimension, i.e. the number of sampling points, N is the number of daily load curves, and for all i, j, there are l _i,j ＝1。

The daily load curve dimension reduction clustering method based on kernel principal component analysis comprises the following steps of 3):

the method for performing principal component analysis on the corrected kernel matrix K comprises the following steps:

a) Performing eigenvalue decomposition on the corrected kernel matrix K to obtain an eigenvalue lambda ₁ ,λ ₂ ,···,λ _N And corresponding feature vector v ₁ ,v ₂ ,···,v _N ；

b) Sorting the characteristic values in a descending order, and adjusting corresponding characteristic vectors;

c) Orthogonal method using Gram-SchmidtUnitizing the feature vector to obtain alpha ₁ ,α ₂ ,···,α _N ；

The method for determining the principal component components and the number of the dimensionality reduction indexes comprises the following steps:

calculating the eigenvalue lambda ₁ ,λ ₂ ,···,λ _N Corresponding cumulative contribution rate B ₁ ,B ₂ ,···,B _N Setting extraction efficiency eta, if the cumulative contribution rate corresponding to the t-th characteristic value is greater than the set extraction efficiency, namely B _t If the t is greater than or equal to eta, the t is set as the number of dimension reduction indexes, and lambda is ₁ ,λ ₂ ,···λ _t Corresponding unitized feature vector alpha ₁ ,α ₂ ,···α _t Namely the main component; the method for setting the efficiency eta comprises the following steps:

a) Setting an initial value of the efficiency η such that t =3, the first three characteristic values λ ₁ ,λ ₂ ,λ ₃ Fitting a curve y = ax + b, and calculating the area S1 of a graph surrounded by the fitting curve and the x axis and the y axis;

b) Changing an initial value of the efficiency eta so that t =4,5,6.. Calculating areas S2, S3, S4 … of a graph surrounded by a fitting curve and an x axis and a y axis;

c) Comparing the area of the graph enclosed under different t values, wherein eta corresponding to the minimum area is the extraction efficiency, and B is satisfied at the moment _t T corresponding to more than or equal to eta is the finally determined characteristic index number.

The daily load curve dimension reduction clustering method based on kernel principal component analysis comprises the following steps of 4):

the method for obtaining the dimension reduction data matrix B comprises the following steps:

calculating a projection Y = K · α of the modified kernel matrix K on the principal component, where α = (α =) ₁ ,α ₂ ,···α _t ) The obtained projection matrix Y is a dimension reduction data matrix B obtained after the data is subjected to kernel principal component analysis and dimension reduction;

the method for determining the weight vector of the dimensionality reduction index comprises the following steps:

let the characteristic value corresponding to each principal component be lambda ₁ ,λ ₂ ,···λ _t Each characteristic valueThe occupied weight is determined according to equation (6),

the weight coefficient w corresponding to each index calculated according to the formula (6) _i That is, the dimension reduction index weight vector W = [ W ] is determined ₁ ,w ₂ ,···,w _t ]。

The daily load curve dimension reduction clustering method based on kernel principal component analysis comprises the following steps of 5):

the clustering step by the weighted k-means algorithm is as follows:

a) Initialization: the initialization comprises initialization of an initial clustering number k, an iteration number n and an initial clustering center;

b) Sample classification: randomly selecting k samples in the dimension reduction data matrix B as initial clustering centers

Let y _i ＝[y _i,1 ,y _i,2 ,···,y _i,t ]For the ith sample in the reduced dimension data matrix B, it goes to the jth cluster center c _j ＝[c _j.1 ,c _j.2 ,···,c _j.t ]The weighted Euclidean distance of (c) is calculated according to the following formula (7), and all samples i are divided into the clustering centers c with the closest weighted Euclidean distances _j In (1),

c) Updating a clustering center: let n _j Is the number of class j samples obtained from step b), y _i,j Updating the clustering centers of all classes according to the following formula (8) for the ith sample in the jth class;

d) And (3) iterative calculation: setting the current iteration number as t, calculating the square error J (t) from all samples in the matrix B to the corresponding clustering center according to the following formula (9), comparing the square error J (t) with the error J (t-1) of the previous time, if J (t) -J (t-1) < 0, skipping to the step B), otherwise, finishing the algorithm and outputting the corresponding clustering result;

the method for determining the optimal clustering number and the clustering result based on the Silhouette index comprises the following steps:

e) Setting maximum cluster number

Minimum number of clusters k _min =2, where N is the number of samples;

f) Based on k _min ≤k≤k _max Obtaining clustering results under different clustering numbers by a weighted k-means algorithm, and then calculating Silhouette indexes for the clustering results respectively;

g) And comparing the Silhouette indexes under different clustering numbers, wherein the k value corresponding to the maximum Silhouette index is the optimal clustering number k, and the corresponding result is the optimal clustering result.

The daily load curve dimension reduction clustering method based on kernel principal component analysis comprises the following steps of:

let N daily load curves be classified into k classes, and for the ith sample in the jth class, define d _a (i) The average distance between the sample i and other samples in the class is shown, and the smaller the value of the average distance is, the more compact the class is; definition of d _b (i) The minimum average distance from the sample i to all samples in non-homogeneous classes is defined, and the larger the value of the minimum average distance is, the more dispersion between the classes is indicated; silhouette index I of sample I _Sil (i) Silhouette index I of all samples, defined as formula (10) _Silhouette Is defined by equation (11):

in the formula (11) I _Silhouette The larger the value is, the better the clustering quality is, the negative value indicates clustering failure, I _Silhouette And k corresponding to the maximum value is the optimal clustering number, and the clustering result corresponding to the k value is the optimal clustering result.

The technical effect of the invention is that the efficiency and quality of daily load curve clustering can be improved to a great extent under the background of big data. On the other hand, the method can effectively solve the problems that the number and the weight of dimension reduction indexes are difficult to determine in the dimension reduction processing process in the indirect clustering method. The clustering result is in accordance with the actual engineering, and powerful support can be provided for a power grid company to analyze the power utilization behavior of users and formulate a reasonable power utilization plan. Has good application prospect.

Drawings

FIG. 1 is a block diagram of the general concept of the method of the present invention.

FIG. 2 is a diagram showing the result of descending order of eigenvalues.

FIG. 3 is a flow chart of the weighted k-means algorithm.

Fig. 4 is a flowchart of determining the optimal cluster number and cluster result based on the Silhouette index.

Detailed Description

The daily load curve dimension reduction clustering method based on kernel principal component analysis provided by the invention is explained in detail by combining the accompanying drawings as follows:

the general idea block diagram of the invention is shown in fig. 1, and comprises the following steps:

1) Preprocessing daily load power curve data to obtain an original data matrix A belonging to R ^N×m . Wherein N is the number of daily load curves, and m is the dimension, namely the number of sampling points;

2) Combining the original data matrix A obtained in the step 1), selecting a Gaussian Radial Basis (RBF) kernel function, carrying out nonlinear mapping on the RBF kernel function to a high-dimensional characteristic space, obtaining a kernel matrix K, and correcting the kernel matrix K;

3) Performing principal component analysis on the corrected kernel matrix K, and determining principal component components and the number of dimension reduction indexes;

4) Combining the kernel matrix K obtained in the step 2) and the principal component components determined in the step 3) to obtain a dimension reduction data matrix B, taking the characteristic values corresponding to the principal component components as dimension reduction index weights, and determining a dimension reduction index weight vector W;

5) And (3) combining the dimension reduction data matrix B obtained in the step 4) and the dimension reduction index weight vector W, clustering by using a weighted k-means algorithm, and determining the optimal clustering number and the clustering result based on the Silhouette index.

The step 1) comprises the following steps:

1-1) identifying and correcting abnormal data in a daily load power curve;

1-2) per unit processing is carried out on the corrected daily load power curve data;

the relevant explanation for the above steps is as follows:

the method for identifying the abnormal data in the step 1-1) specifically comprises the following steps:

note P _k ＝[p _k，1 ,p _k,2 ···,p _k,m ] ^T And (3) identifying abnormal data by using a formula (1) for the power value of a certain load curve at each sampling time point.

In the formula: delta _k,i And (3) regarding the load power change rate of the load curve at the ith point as abnormal data when the load power change rate exceeds a preset threshold value epsilon. Without loss of generality, ε may be 0.5-0.8.

The method for correcting the abnormal data in the step 1-1) comprises the following specific steps:

if the data loss amount and the abnormal amount of a certain load curve reach 10% or more, the curve is determined to be invalid and the load curve is directly deleted.

If the data missing amount and the abnormal amount of a certain load curve are lower than 10%, setting the values of the missing amount and the abnormal amount to be 0, and then carrying out interpolation fitting on the missing amount and the abnormal amount which are set to be 0 by using a unitary three-point parabolic interpolation algorithm to obtain a normal value again. The principle of the unitary three-point parabolic interpolation algorithm is as follows:

let n number of nodes x _i (i =0,1, ·, n-1) has a function value y _i ＝f(x _i ) Has x ₀ ＜x ₁ ＜···＜x _n-1 Corresponding to function value y ₀ ＜y ₁ ＜···＜y _n-1 . To calculate the approximate function value z = f (t) for a given interpolation point t, the 3 nodes closest to t are selected: x is the number of _k-1 、x _k 、x _k+1 (x _k ＜t＜x _k+1 ) Then the value of z is calculated according to the parabolic interpolation equation (2), i.e.

In the formula, when | x _k -t|＜|t-x _k+1 I, |, m = k-1; when | x _k -t|＞|t-x _k+1 I, |, m = k.

If the interpolation point t is not in the interval containing n nodes, only 2 nodes at one end of the interval are selected for linear interpolation.

The method for performing per unit processing on the corrected daily load power curve data in the step 1-2) specifically includes:

note P _k ＝[p _k1 ,...,p _ki ,...,p _km ]∈R ^1×m For the m-point original active power matrix of the corrected k-th daily load power curve, k =1,2,3, …, N is the total number of daily load power curves, p _ki The original active power of the ith point of the kth daily load power curve is i =1,2, …, m, m is the number of sampling points, namely dimension, and is generally 48; then P = [ P = ₁ ,...,P _k ,...,P _N ] ^T ∈R ^N×m Is an m-point original active power matrix of N daily load power curves, wherein R ¹ × ^m Refers to a real number matrix of 1 row and m columns, R ^N × ^m A real number matrix with N rows and m columns;

p' _ki ＝p _ki /p _k·max (3)

obtaining a normalized daily load power curve active power per unit value matrix P' _k ＝[p' _k1 ,p' _k2 ,...,p' _ki ,...,p' _km ]∈R ^1×m And let the matrix be A ∈ R ^N×m 。

2) Combining the original data matrix A obtained in the step 1), selecting a Gaussian Radial Basis Function (RBF) kernel function, performing nonlinear mapping on the RBF kernel function to a high-dimensional characteristic space to obtain a kernel matrix K, and correcting the kernel matrix K; the step 2) comprises the following steps;

2-1) carrying out nonlinear mapping on the original data matrix A to a high-dimensional space to obtain a kernel matrix K;

2-2) correcting the kernel matrix K;

the above steps are explained in relation to the following:

the method for performing nonlinear mapping on the original data matrix A to the high-dimensional space in the step 2-1) to obtain the kernel matrix K specifically comprises the following steps:

let vector x _i ∈R ^1×m ,y _j ∈R ^1×m And (i ≠ j) is respectively any two daily load curve per unit data in the original data matrix A, and nonlinear mapping is carried out on the matrix A according to a formula (4) by a Gaussian Radial Basis (RBF) kernel function to obtain a kernel matrix K.

The method for correcting the kernel matrix K in the step 2-2) comprises the following steps:

the kernel matrix K is corrected in accordance with the following formula (5).

In formula (5): i, j =1,2, ·, m, m is the daily load curve dimension, N is the number of daily load curves, and there are l for all i, j _i,j ＝1。

the method for performing principal component analysis on the corrected kernel matrix K in the step 3) comprises the following steps:

b) Sorting the eigenvalues in descending order (fig. 2 is the result of sorting the eigenvalues in descending order), and adjusting the corresponding eigenvectors;

c) Utilizing Gram-Schmidt orthogonal method to unitize characteristic vector to obtain alpha ₁ ,α ₂ ,···,α _N ；

The method for determining the number of the principal component components and the dimensionality reduction indexes in the step 3) comprises the following steps:

calculating the eigenvalue lambda ₁ ,λ ₂ ,···,λ _N Corresponding cumulative contribution rate B ₁ ,B ₂ ,···,B _N Setting the extraction efficiency eta, if B _t The number of dimension reduction indexes is set to t and lambda ₁ ,λ ₂ ,···λ _t Corresponding unitized feature vector alpha ₁ ,α ₂ ,···α _t I.e. the principal component. The method for setting the efficiency eta comprises the following steps:

a) Setting an initial value of the efficiency η such that t =3, the first three characteristic values λ ₁ ,λ ₂ ,λ ₃ Fitting the curve y = ax + b, and calculating the area S1 of the graph surrounded by the fitting curve, the x axis and the y axis;

b) Changing an initial value of the efficiency eta so that t =4,5,6.. The area S2, S3 and S4 … of a graph surrounded by a fitting curve and an x axis and a y axis is calculated;

c) And comparing the sizes of the areas of the graphs enclosed under different t values, wherein the t corresponding to the minimum area is the finally determined number of the characteristic indexes.

the method for obtaining the dimensionality reduction data matrix B in the step 4) comprises the following steps:

calculating a projection Y = K · α of the corrected kernel matrix K on the extracted principal component, where α = (α =) ₁ ,α ₂ ,···α _t ). And the obtained projection matrix Y is a dimension reduction data matrix B obtained after the data is subjected to kernel principal component analysis and dimension reduction.

The method for determining the dimensionality reduction index weight vector in the step 4) comprises the following steps:

let the eigenvalue corresponding to each principal component be λ ₁ ,λ ₂ ,···λ _t The weight of each feature value is determined according to equation (6).

The weight coefficient w corresponding to each index can be obtained according to the formula (6) _i That is, the dimension reduction index weight vector W = [ W ] is determined ₁ ,w ₂ ,···,w _t ]。

5) Combining the dimensionality reduction data matrix B obtained in the step 4) and the dimensionality reduction index weight vector W, clustering by using a weighted k-means algorithm, and determining the optimal clustering number and a clustering result based on a Silhouette index;

as shown in fig. 3, the step of performing cluster analysis by using the weighted k-means algorithm in step 5) includes:

a) And (5) initializing. The initialization comprises initialization of an initial clustering number k, an iteration number n and an initial clustering center, wherein the iteration number n is generally 100;

b) And (6) classifying the samples. Randomly selecting k samples in the dimension reduction data matrix B as initial clustering centers

Let y _i ＝[y _i,1 ,y _i,2 ,···,y _i,t ]For the ith sample in the reduced dimension data matrix B, it goes to the jth cluster center c _j ＝[c _j.1 ,c _j.2 ,···,c _j.t ]The weighted euclidean distance of (a) is calculated by the following equation (7). All samples i are divided into the clustering centers c with the nearest weighted Euclidean distances _j In (1),

c) And updating the clustering center. Let n _j Is the number of class j samples obtained from step b), y _i,j Updating the clustering centers of all classes according to the following formula (8) for the ith sample in the jth class;

d) And (5) performing iterative computation. And (3) setting the current iteration number as t, calculating the square error J (t) from all samples in the matrix B to the corresponding clustering center according to the following formula (9), comparing the square error J (t) with the previous error J (t-1), and jumping to the step B if the J (t) -J (t-1) < 0, otherwise, finishing the algorithm and outputting the corresponding clustering result.

As shown in fig. 4, the method for determining the optimal clustering number and the clustering result based on the Silhouette index in step 5) includes the steps of:

a) And setting the maximum and minimum cluster numbers. Without loss of generality, set

k _min =2, where N is the number of samples.

b) Taking the clustering number k _min ≤k≤k _max And obtaining clustering results under different clustering numbers based on a weighted k-means algorithm, and calculating Silhouette indexes under different clustering numbers. E.g. the preceding step gives k _max Is 10, k _min And 2, the set k value is changed in the range, namely clustering results with the clustering numbers between 2 and 10 are calculated, and the Silhouette index of the clustering results under the clustering numbers is calculated.

c) And (5) comparing Silhouuette indexes under different clustering numbers. And when the Silhouette index is maximum, the corresponding k value is the optimal clustering number k, and the corresponding result is the optimal clustering result.

The relevant explanation for the above steps is as follows:

the method for defining the Silhouette index in the step b) specifically comprises the following steps:

let N daily load curves be classified into k classes, and for the ith sample in the jth class, define d _a (i) The average distance between the sample i and other samples in the class is shown, and the smaller the value of the average distance is, the more compact the class is; definition of d _b (i) The smallest average distance from sample i to all samples within a non-homogeneous class, with larger values indicating a greater dispersion between classes. Silhouette index I of sample I _Sil (i) Silhouette index I of all samples, defined as formula (10) _Silhouette Is defined by the formula (11).

In the formula (11) I _Silhouette The larger the value is, the better the clustering quality is, the negative value is, the clustering failure is indicated, I _Silhouette The k corresponding to the maximum value is the optimal clustering number.

The daily load curve dimensionality reduction clustering method based on kernel principal component analysis comprises the steps of firstly utilizing a Gaussian Radial Basis Function (RBF) kernel function to carry out nonlinear mapping on a preprocessed daily load data matrix to a high-dimensional space to obtain a kernel matrix K; then, performing eigenvalue decomposition on the kernel matrix K to obtain corresponding eigenvalues and unitized eigenvectors, setting extraction efficiency according to the descending trend of the eigenvalues, and obtaining principal component components and the number of dimension reduction indexes; secondly, taking the characteristic value as the weight reduced as an index, carrying out normalization processing on the characteristic value to obtain a weight vector W after the sum of the weight vector W and taking the projection of the kernel matrix on the principal component as a dimension reduction data matrix B; and finally, clustering the daily load curve by taking the dimension reduction data matrix B and the weight vector W as the input of a weighted k-means algorithm, and determining the optimal clustering number and the clustering result based on the Silhouette index.

Claims

1. A daily load curve dimension reduction clustering method based on kernel principal component analysis is characterized by comprising the following steps:

step 3), performing principal component analysis on the corrected kernel matrix, and determining the number of principal component components and dimension reduction indexes: firstly, decomposing characteristic values to obtain characteristic values and characteristic vectors, then obtaining unitized characteristic vectors by a Gram-Schmidt orthogonal method, calculating the cumulative contribution rate corresponding to the characteristic values, then forming different fitting curves by fitting the characteristic values of which the quantity is increased in sequence, selecting the quantity of the characteristic values corresponding to the fitting curve with the smallest area of the graph surrounded by the x axis and the y axis as the quantity of characteristic indexes, and taking the quantity as the quantity of dimension reduction indexes, wherein the unitized characteristic vectors corresponding to the quantity are the principal component;

step 4), taking a projection matrix obtained by projection of the corrected kernel matrix on the principal component components as a dimensionality reduction data matrix obtained after kernel principal component analysis dimensionality reduction, then taking a characteristic value corresponding to each principal component as a dimensionality reduction index weight, and simultaneously obtaining a corresponding dimensionality reduction index weight vector;

2. The daily load curve dimension reduction clustering method based on kernel principal component analysis as claimed in claim 1, wherein said step 1) comprises the following processes:

let n nodes x _i Has a function value of y _i ＝f(x _i ) Wherein i =0,1, ·, n-1, has x ₀ ＜x ₁ ＜···＜x _n-1 Corresponding to function value y ₀ ＜y ₁ ＜···＜y _n-1 To calculate the approximate function value z = f (t) of the specified interpolation point t, the 3 nodes closest to t are selected: x is the number of _k-1 、x _k 、x _k+1 Wherein x is _k ＜t＜x _k+1 Then the value of z is calculated according to the formula (2) of parabolic interpolation, i.e.

step 1-2), per unit processing is carried out on the corrected daily load power curve data:

note P _k ＝[p _k1 ,...,p _ki ,...,p _km ]∈R ^1×m For the m-point original active power matrix of the corrected k-th daily load power curve, k =1,2,3, …, N is the total number of daily load power curves, p _ki The original active power of the ith point of the kth daily load power curve is i =1,2, …, and m is the number of sampling points; then P = [ P = ₁ ,...,P _k ,...,P _N ] ^T ∈R ^N×m Is an m-point original active power matrix of N daily load power curves, wherein R ^1×m Refers to a real number matrix of 1 row and m columns, R ^N×m A real number matrix with N rows and m columns;

p' _ki ＝p _ki /p _{k. max} (3)

P' _k ＝[p' _k1 ,p' _k2 ,...,p' _ki ,...,p' _km ]∈R ^1×m and let the matrix be A ∈ R ^N×m Wherein N is the number of daily load curves.

3. The daily load curve dimension reduction clustering method based on kernel principal component analysis as claimed in claim 1, wherein said step 2) comprises the following processes:

let vector x _i ∈R ^1×m ,y _j ∈R ^1×m Respectively performing per unit data on any two daily load curves in an original data matrix A, wherein i is not equal to j, and performing nonlinear mapping on the matrix A by a Gaussian radial basis kernel function according to a formula (4) to obtain a kernel matrix K;

wherein i =1,2 ·, N j =1,2 ·, N i ≠ j;

step 2-2), correcting the kernel matrix K:

the kernel matrix K is corrected according to the following formula (5):

in formula (5): i, j =1,2, ·, m, m is the dimension of daily load curve, i.e. the number of sampling points, N is the number of daily load curve, and for all i, j, there are l _i,j ＝1。

4. The daily load curve dimension reduction clustering method based on kernel principal component analysis as claimed in claim 1, wherein said step 3) comprises the following processes:

c) Comparing the enclosed graph surfaces under different t valuesThe product size and the eta corresponding to the minimum area are the extraction efficiency, and satisfy the condition that B is _t T corresponding to more than or equal to eta is the finally determined characteristic index number.

5. The daily load curve dimension reduction clustering method based on kernel principal component analysis as claimed in claim 1, wherein said step 4) comprises the following processes:

let the eigenvalue corresponding to each principal component be λ ₁ ,λ ₂ ,···λ _t The weight occupied by each characteristic value is determined according to the formula (6),

6. The daily load curve dimension reduction clustering method based on kernel principal component analysis as claimed in claim 1, wherein said step 5) comprises the following processes:

the clustering step by the weighted k-means algorithm is as follows:

Wherein j =1,2, ·, k, let y _i ＝[y _i,1 ,y _i,2 ,···,y _i,t ]For the ith sample in the dimension-reduced data matrix B, it goes to the jth cluster center c _j ＝[c _j.1 ,c _j.2 ,···,c _j.t ]The weighted Euclidean distance of (c) is calculated according to the following formula (7), and all samples i are divided into the clustering centers c with the closest weighted Euclidean distances _j In (1),

c) Updating a clustering center: let n _j Is the number of class j samples obtained from step b), y _i,j Updating various cluster centers for the ith sample in the jth class according to the following formula (8);

wherein j =1,2, ·, k;

e) Setting maximum cluster number

Minimum number of clusters k _min =2, where N is the number of samples;

f)based on k _min ≤k≤k _max Obtaining clustering results under different clustering numbers by a weighted k-means algorithm, and then calculating Silhouette indexes for the clustering results respectively;

7. The daily load curve dimension reduction clustering method based on kernel principal component analysis according to claim 6, wherein the method for defining the Silhouette index in the step f) is as follows:

let N daily load curves be classified into k classes, and for the ith sample in the jth class, define d _a (i) The average distance between the sample i and other samples in the class is shown, and the smaller the value of the average distance is, the more compact the class is; definition of d _b (i) The minimum average distance from the sample i to all samples in the non-homogeneous class is defined, and the larger the value of the minimum average distance is, the more dispersion between the classes is indicated; silhouette index I of sample I _Sil (i) Silhouette index I of all samples, defined as formula (10) _Silhouette Is defined by equation (11):