CN111723876B - Load curve integrated spectrum clustering method considering double-scale similarity - Google Patents

Load curve integrated spectrum clustering method considering double-scale similarity Download PDF

Info

Publication number
CN111723876B
CN111723876B CN202010699981.9A CN202010699981A CN111723876B CN 111723876 B CN111723876 B CN 111723876B CN 202010699981 A CN202010699981 A CN 202010699981A CN 111723876 B CN111723876 B CN 111723876B
Authority
CN
China
Prior art keywords
clustering
load
similarity
distance
dbi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010699981.9A
Other languages
Chinese (zh)
Other versions
CN111723876A (en
Inventor
万灿
徐胜蓝
于建成
曹照静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202010699981.9A priority Critical patent/CN111723876B/en
Publication of CN111723876A publication Critical patent/CN111723876A/en
Application granted granted Critical
Publication of CN111723876B publication Critical patent/CN111723876B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a load curve integrated spectrum clustering method considering double-scale similarity, which calculates load morphological change similarity through cosine distance of load differential vector and constructs a double-scale similarity measurement mode for measuring load similarity; the clustering performance improvement and the effective combination of the two measurement modes are realized through integrated clustering, the spectrum clustering is used as a basic clustering model generation method, the diversity of the basic clustering is ensured by selecting different similarity measurement modes, setting different cluster numbers and random operation, a weighted consistency matrix and the spectrum clustering are used as a clustering integration strategy, a Davison baudin index DBI or a new index MDBI is used as a clustering evaluation index in the clustering integration process, a consistency matrix is calculated by taking the reciprocal of the DBI or the MDBI as a weight self-adaptive setting basis, and the final integrated clustering division is realized through the spectrum clustering. The method has excellent clustering effectiveness and robustness, and can avoid the defect that a single spectral clustering method needs parameter optimization for different data sets.

Description

Load curve integrated spectrum clustering method considering double-scale similarity
Technical Field
The invention relates to a load curve integrated spectrum clustering method considering double-scale similarity, and belongs to the field of load characteristic analysis of power systems.
Background
Under the background of urban energy Internet, the construction perfection of an electricity utilization information acquisition system and a scheduling, operation and inspection and marketing business system promotes the rapid accumulation of electric power data resources. Valuable information such as energy use characteristics is hidden in the power data, and data analysis technology needs to be applied to mining. Clustering is used as an unsupervised learning technology, is suitable for being applied to classification of label-free load curves, provides classification results according to load characteristic differences for power enterprises, helps the power enterprises to accurately master user energy behavior rules, and provides powerful support for applications such as demand side response, load prediction and power utilization abnormity detection.
The load characteristic analysis by the clustering technology has a deeper research foundation. The related research of load clustering mainly focuses on the following three aspects: 1) a load clustering method. The effectiveness of the load clustering result is the key for ensuring the application value of the load clustering result, and how to design a proper load clustering method to improve the load clustering quality is one of research hotspots. 2) A load similarity measure. And a reasonable distance measurement mode is selected according to the load clustering purpose, the similarity of load characteristics of different users is measured, and the clustering effect is more accurate and effective. 3) And (5) load data feature extraction. The low-dimensional characteristics capable of effectively reflecting the load characteristic difference are extracted from the high-dimensional load curve data, and the load clustering quality and efficiency can be improved.
Euclidean distance is a classical similarity measure in load clustering. In the transformer substation load characteristic clustering analysis, the transformer substation load characteristic is represented by transformer substation user composition and curve Euclidean distance. Research literature exists for applying a fast search density peak value clustering method to load clustering and introducing a histogram equalization technology to improve the load clustering effect. Research literature has been conducted to perform two-layer clustering by combining partitional clustering and hierarchical clustering, so that the advantages of the two-layer clustering method are complemented, and the load clustering effectiveness is optimized. The load clustering analysis research adopts the Euclidean distance as the measurement basis of load similarity, but the Euclidean distance focuses on calculating curve distance, and the load clustering analysis research has limitation in the aspect of mining the similarity of load curve form changes. In order to improve the similarity measurement problem, research literature measures load similarity from both the distance and morphological characteristics, and performs load classification by using a spectral clustering method. Research literature introduces dynamic time warping distance and cross-correlation methods to improve the morphological similarity calculation mode of the load sequence. The research literature introduces a concept of typical time warping distance, and the concept is combined with a Gaussian kernel function to represent the similarity of a load sequence in a space-time dual-scale mode, so that a spectrum multi-manifold clustering method is improved. At present, a single clustering method or a two-layer clustering method is mostly adopted for load clustering research, but the methods generally have limitations, for example, parameters need to be debugged again for different data sets, adaptability to different data structures is different, multiple parameters need to be debugged, and the like, and the problems can cause adverse effects on load clustering quality.
Disclosure of Invention
In order to solve the summarized problem in the background technology, the invention provides a load curve integrated spectrum clustering method considering double-scale similarity.
In order to achieve the purpose, the invention adopts the following technical scheme:
a load curve integrated spectral clustering method considering double-scale similarity is characterized in that a double-scale similarity measurement mode is constructed by combining difference cosine distance and Euclidean distance, a differential basis clustering model is constructed by adopting a spectral clustering method based on double-scale similarity measurement, and clustering integration is realized by using a consistency matrix based on self-adaptive weighting of intra-cluster evaluation indexes and spectral clustering.
The specific method comprises the following steps:
firstly, calculating load form change similarity through cosine distance of a load differential vector, and constructing a dual-scale similarity measurement mode for measuring load similarity so as to make up for the deficiency of similarity of Euclidean distance measurement load characteristics; then, taking spectral clustering as a generation method of a base clustering model, setting different clustering numbers and randomly operating to construct a differential base clustering model by selecting different similarity measurement modes, and ensuring the diversity of the base clustering model; and finally, taking the weighted consistency matrix and the spectral clustering as a clustering integration strategy, taking the Theisenberg index DBI or a new index MDBI as a clustering evaluation index in the clustering integration process, taking the reciprocal of the DBI or the MDBI as a weight self-adaptive setting basis to calculate the weighted consistency matrix, then realizing the final integrated clustering division by the spectral clustering, and realizing the clustering performance improvement and the effective combination of the two measurement modes by the integrated clustering.
In the above technical solution, further, in order to avoid the influence of the load curve amplitude difference on the load form similarity calculation result, before the method is operated, the maximum normalization processing is performed on the load curve, and the specific method includes:
the power consumption of different users is different, the daily load curves are inconsistent in amplitude and even greatly different, but the load clustering classification is based on the similarity of load forms, and the curve amplitude has no significance in similarity calculation. In order to avoid the influence of the curve amplitude difference on the similarity calculation result, the load curve needs to be normalized first.
Assuming that there are m load curves in the load dataset, and the dimension of the load curve is n, all load samples will be classified into k clusters during clustering. Processing the load data by adopting a maximum value normalization method, wherein the definition is shown as the following formula:
Figure GDA0003212852940000031
in the formula, xijNormalizing the jth dimension value of the ith load curve original data to obtain a value;
Figure GDA0003212852940000032
representing the j dimension value of the original data of the i-th load curve;
Figure GDA0003212852940000033
represents the maximum value in all dimensions of the raw data of the ith load curve.
Furthermore, a first-order difference operation is performed on the normalized load curve data, and then a cosine distance of a first-order difference vector of the load, namely a difference cosine distance, is calculated, so as to reflect the consistency of the morphological changes of the two load curves, and the specific method comprises the following steps:
the normalized load curve data is subjected to first-order difference operation, and power change vectors reflecting the morphological change characteristics of ascending, descending, stability and the like of each load curve can be extracted. The cosine distance is calculated by cosine similarity, and the cosine similarity is used for measuring the similarity of two vectors through the cosine value of an included angle between the two vectors in a vector space. The cosine distance represents the relative difference in the vector direction, and the cosine distance of the first-order difference vector of the load can be used for reflecting the consistency of the form change of the two load curves. The value range of the difference cosine distance is [0,2], and the smaller the value is, the higher the similarity of the form changes of the two load curves is.
The first order difference operation of the load is defined as:
Figure GDA0003212852940000041
in the formula (I), the compound is shown in the specification,
Figure GDA0003212852940000042
and the j-th dimension value of the ith differential load vector after the first-order differential operation is represented.
The differential cosine distance of the load is defined as:
dcii′=1-cii′
Figure GDA0003212852940000043
in the formula, dcii'The cosine distance of the ith and ith 'load differential vectors is represented, namely the differential cosine distance of the ith and ith' load curves; c. Cii'Representing cosine similarity of the ith and ith' load differential vectors;
Figure GDA0003212852940000044
is the ith load difference vector;
Figure GDA0003212852940000045
a 2-norm representing the ith load difference vector; in the second expression, the multiplication sign of the numerator represents the vector dot multiplication, and the multiplication sign of the denominator represents the numerical multiplication.
Further, a load curve comprehensive distance based on double-scale similarity is constructed by combining the difference cosine distance and the Euclidean distance, the load distance and the similarity degree of morphological change are considered, the comprehensive distance can be obtained by a linear function, and the specific method comprises the following steps of:
the composite distance is defined as:
dsii′=ae·deii′+ac·dcii′·r
in the formula (ds)ii'Representing the integrated distance of the ith and ith' load curves; deii'The Euclidean distance of the ith and ith' load curves; dcii'The difference cosine distance of the ith and ith' load curves; a ise、acRespectively the weight coefficients of the Euclidean distance and the difference cosine distance when calculating the comprehensive distance, considering that the two similarities are effective measurement modes, aeAnd acAll are taken as 0.5; because the Euclidean distance and the difference cosine distance are different in value range, the difference cosine distance needs to be amplified by r times, and r is a proportionality coefficient.
Since the minimum limit of the load differential cosine distance and the euclidean distance are both 0, and the maximum limit is not consistent, the proportionality coefficient r is calculated by the following formula:
Figure GDA0003212852940000051
in the formula, demax、dcmaxRespectively representing the maximum value of the euclidean distance and the maximum value of the differential cosine distance of all load curves in the data set.
Selecting a Davison baud index DBI and adjusting a landed index ARI as internal and external evaluation indexes of a clustering effect of the method, and considering that Euclidean distances are adopted in a classic DBI formula to measure distances of different data samples, the result effectiveness of the clustering method adopting other similarity measurement modes cannot be accurately evaluated, so that the comprehensive distance is applied to distance calculation of the DBI, and the specific method is as follows:
Davies-Bouldin Index (DBI), proposed by David, Torvis and Tangnad burger to evaluate the cluster validity, also called classification accuracy Index. The DBI comprehensively considers the similarity of the samples in the clusters and the difference of the samples among the clusters, the smaller the value of the DBI, the higher the clustering effectiveness, and the specific definition is as follows:
Figure GDA0003212852940000061
in the formula, deiRepresents the mean Euclidean distance, de (C), of the sample of class i to its class centeri,Cj) Representing the euclidean distance of class centers for class i and j.
Adjusting the landed Index (ARI) is a common external evaluation Index for clustering, which performs an evaluation of clustering validity by calculating the number of sample pairs allocated to the same or different clusters in the true label and clustering result, and is specifically defined as:
Figure GDA0003212852940000062
Figure GDA0003212852940000063
wherein RI represents a Lande index; TP represents the number of sample pairs that are classified as the same class in the true label and also as the same class in the clustering result; TN represents the number of sample pairs classified as heterogeneous in the real label and also classified as heterogeneous in the clustering result;
Figure GDA0003212852940000064
representing the number of combinations of any two samples taken from the m load samples. E (RI) is the expected value of RI, and max (RI) represents the maximum value of RI. ARI value range is [ -1,1]The larger the value is, the closer the clustering result is to the real situation, and ARI is 1, which indicates that the clustering result is consistent with the real label.
Selecting a davison fortunei index DBI and adjusting a landed index ARI as internal and external evaluation indexes of a method clustering effect, considering that the Euclidean distance is adopted in a classic DBI formula to measure the distance of different data samples, and the result effectiveness of a clustering method adopting other similarity measurement modes cannot be accurately evaluated, so that the comprehensive distance is applied to distance calculation of the DBI to construct a new index (Modified DBI, MDBI), namely:
Figure GDA0003212852940000071
in the formula, MDBI is a new index used for evaluating the effectiveness of the load clustering result considering the double-scale similarity; dsiThe average integrated distance from the sample in the ith class to the class center of the sample; ds (C)i,Cj) Indicating the combined distance of class centers for the ith and jth classes.
Further, the construction method of the differentiated base clustering model comprises the following steps:
the spectral clustering method is evolved from graph theory, a data sample is regarded as a distribution point in space, the points are connected by edges with weights, and the weight values of the edges are in direct proportion to the similarity between the points of the data sample. The method is characterized in that a nondirectional weight graph formed by a space inner point and a weighted edge is subjected to graph cutting by spectral clustering, and the main aim is to enable the weight values of the edges among different subgraphs to be as low as possible after graph cutting and enable the weight values of the edges in the subgraphs to be as high as possible. The spectral clustering performance is excellent, and the adaptability to data distribution is strong.
In spectral clustering, the edge weight of an undirected graph is represented by a similarity matrix, and in most spectral clustering methods, a Gaussian kernel function is adopted to calculate the similarity matrix, namely:
Figure GDA0003212852940000072
in the formula, sii'The element value of the ith row and the ith 'column of the similarity matrix, namely the weight value of the edge between the ith data sample point and the ith' data sample point; dii'Distance representing the ith and ith' load curves; σ is a scale parameter of the kernel function.
In the spectral clustering method, the similarity measurement among different load curves is mainly embodied in d of a similarity matrixii'In general, dii'And the squared Euclidean distance is adopted, so that the squared Euclidean distance between the inner class and the outer class is optimized when the load class cluster is divided by the spectral clustering under the condition. Spectral clustering method for measuring load similarity by using differential cosine distance to calculate similarity matrix by using differential cosine distance instead of squared Euclidean distanceThe similarity matrix is defined as follows:
Figure GDA0003212852940000081
the base clustering result can be generated by adopting different clustering methods, setting different cluster numbers, randomly running for multiple times and the like. Selecting spectral clustering as a base clustering method, taking a fixed value for a scale parameter (an empirical value taken from an experimental result, specifically, evaluating the quality of a result according to an evaluation index, and then selecting a scale parameter which is good in performance in a plurality of data sets as a fixed value according to the result), and ensuring the diversity of a base clustering model through the following three aspects: 1) the similarity measurement mode adopts Euclidean distance or difference cosine distance; 2) setting different cluster numbers with the value range of [ k ]min,kmax]Each of which is an integer; 3) the method for setting each pair of parameter combinations of the first two parameters randomly runs for many times, and the number of times is p. In an undirected graph segmentation mode of spectral clustering, an NCut graph segmentation method is selected to process an undirected weight graph obtained by a similar matrix, and a feature matrix obtained after a dimension reduction in the graph segmentation process is clustered by k-means.
Further, the method for integrating the base clustering model by adopting the weighted consistency matrix method comprises the following steps:
the consistency matrix method is a widely used classical clustering integration strategy, which converts a base clustering model into an m × m consistency matrix by calculating the probability that different samples are divided into the same type of clusters in all the base clustering models:
Figure GDA0003212852940000082
in the formula, conijThe value of the ith row and the ith' column of the consistency matrix; b represents the number of the base clustering models; i { } is an indicator function, and when the formula is established in brackets, the value is 1, otherwise, the value is 0; l isb(i) A class cluster label representing the ith sample in the b-th basis clustering model.
When the set of base clustering models contains low-validity members, the validity of each base clustering model is not considered, and the performance of the integrated clustering method is adversely affected by simple integration. Therefore, the cluster evaluation indexes of different base cluster models need to be combined, the clustering effectiveness of the different base cluster models is considered in the consistency matrix calculation process to perform adaptive weight setting, and the influence of the different base cluster models on the integrated clustering is adjusted.
When only the distance difference of the curves is considered and the clustering performance is optimized through integrated clustering, the DBI can be adopted to calculate the weight of the base clustering model; and when the distance of the load curve and the form change difference are comprehensively considered, the MDBI can be adopted to calculate the base clustering weight. The smaller the DBI and the MDBI are, the higher the effectiveness of the clustering result is represented, so the weight value of the base clustering model is the reciprocal of the DBI or the MDBI of the corresponding base clustering model. The weighted consistency matrix is defined as follows:
Figure GDA0003212852940000091
Figure GDA0003212852940000092
in the formula, wbThe weight of the b-th base clustering model when the weighted consistency matrix is calculated; inbAnd the cluster evaluation index of the b-th base cluster model can be represented as DBI or MDBI. The second equation scales the weights of the basis clustering models to make the sum of the weights 1, so as to make the value range of the elements of the weighted consistency matrix be [0, 1%]。
The weighted consistency matrix can be regarded as a similarity matrix reflecting the similarity of the samples, and the similarity matrix is processed through spectral clustering. Like the basis clustering algorithm, the spectral clustering in the integration process also adopts the pattern cutting mode of NCut and selects k-means to cluster the feature matrix.
The invention has the beneficial effects that:
according to the load curve integrated spectral clustering method considering the double-scale similarity, the spectral clustering algorithm is improved through the integrated learning idea, the cluster quality of load clustering is improved, the clustering effectiveness is excellent, the integrated spectral clustering is more stable in performance in different data sets, the robustness is excellent, and the defect that a single spectral clustering algorithm needs to debug scale parameters again for different data sets is overcome; the integrated spectral clustering method effectively combines the Euclidean distance and the difference cosine distance through the integration of the differential basis clustering model, comprehensively considers the similarity of the load double-scale, and can more accurately and effectively mine the load form change information reflecting the load energy consumption mode; and the effectiveness and robustness of the load clustering are further optimized through effective weighting of the base clustering in the base clustering integration process.
Drawings
FIG. 1 is a frame diagram of a load curve integrated spectral clustering method considering two-scale similarity;
FIG. 2 is a schematic diagram of a load curve integrated spectral clustering result considering dual-scale similarity of a autumn load data set;
FIG. 3 is a schematic diagram of a data set D1;
fig. 4 is a schematic diagram of a data set D2.
Detailed Description
The invention is further described with reference to the accompanying drawings and examples.
The framework of the load curve integrated spectral clustering method considering the two-scale similarity is shown in FIG. 1.
(1) Firstly, processing load data by adopting a maximum value normalization method, wherein the definition is shown as the following formula:
Figure GDA0003212852940000101
in the formula, xijNormalizing the jth dimension value of the ith load curve original data to obtain a value;
Figure GDA0003212852940000102
representing the j dimension value of the original data of the i-th load curve;
Figure GDA0003212852940000103
represents the maximum value in all dimensions of the raw data of the ith load curve.
And performing first-order difference operation on the normalized load curve data, and calculating the cosine distance of the first-order difference vector of the load, namely the difference cosine distance, so as to reflect the consistency of the morphological changes of the two load curves. And then, a load curve comprehensive distance based on double-scale similarity is constructed by combining the difference cosine distance and the Euclidean distance, the load distance and the similarity degree of morphological change are considered, and the comprehensive distance can be obtained through a linear function.
(2) Selecting spectral clustering as a base clustering method, taking a fixed value for a scale parameter of the spectral clustering, selecting an NCut graph cutting method to process a undirected weight graph obtained by a similar matrix in an undirected graph cutting mode of the spectral clustering, and selecting k-means to cluster a feature matrix obtained after dimension reduction in the graph cutting process.
The diversity of the base clustering model is ensured by the following three aspects:
1) the similarity measurement mode adopts Euclidean distance or difference cosine distance. And adopting a Gaussian kernel function to calculate the similarity matrix.
2) Setting different cluster numbers with the value range of [ k ]min,kmax]Each of which is an integer.
3) The method for setting each pair of parameter combinations of the first two parameters randomly runs for many times, and the number of times is p.
(3) And calculating a weighted consistency matrix by combining the clustering evaluation indexes of different base clustering models.
When only the distance difference of the curves is considered and the clustering performance is optimized through integrated clustering, the DBI can be adopted to calculate the weight of the base clustering model; and when the distance of the load curve and the form change difference are comprehensively considered, the MDBI can be adopted to calculate the base clustering weight. And the weight value of the base clustering model is the reciprocal of the DBI or the MDBI of the corresponding base clustering model. The weighted consistency matrix is defined as follows:
Figure GDA0003212852940000111
Figure GDA0003212852940000112
in the formula, wbThe weight of the b-th base clustering model when the weighted consistency matrix is calculated; inbAnd the cluster evaluation index of the b-th base cluster model can be represented as DBI or MDBI. The second equation scales the basis cluster model weights to sum to 1.
(4) The similarity matrix is processed by spectral clustering. Like the basis clustering algorithm, the spectral clustering in the integration process also adopts the pattern cutting mode of NCut and selects k-means to cluster the feature matrix.
(5) Evaluating the integrated spectral clustering model based on clustering evaluation indexes, wherein the evaluation indexes comprise: internal evaluation indices DBI and MDBI, external evaluation ARI. And selecting the clustering number by an index optimal method.
An example is constructed by adopting actual measurement user load data of four seasons of a certain city in south China for one day, and the data sampling interval is 15 min. The data of the sample calculation after data preprocessing comprises 1565 users, and the sample calculation comprises various load types such as industry, business, residents and the like.
(1) Integrated spectral clustering and internal evaluation index verification considering only distance difference
And comparing the performance of the multi-class load clustering algorithm with the performance of the integrated spectrum clustering algorithm considering DBI weighting on the four-season load data set. The comparison algorithm comprises the following steps: 1) a k-means algorithm for measuring similarity by Euclidean distance, which is called kmeu for short; 2) a spectral clustering algorithm which measures similarity by Euclidean distance and has fixed scale parameters is called speu for short; 3) a spectral clustering algorithm which measures similarity by Euclidean distance and optimizes scale parameters is called speu-gamma for short; 4) a two-layer clustering algorithm, km-ag for short; 5) and (4) an integrated spectral clustering algorithm without considering index weighting, which is called ESC-1 for short.
The specific algorithm parameter settings are shown in table 1. Wherein the number of clusters is selected as [ k ]min,kmax]Each inAn integer number; considering that clustering cluster number is too small to cause cluster meaningless, the minimum value kminAll take the value of 3; in order to ensure the diversity of the base clustering model and considering that the load optimal cluster number in most researches is single digit, the maximum value k of the cluster numbermaxAll take the value of 9; in the speu algorithm, a scale parameter sigma is fixed, and the scale parameter sigma is selected through experiments to make the algorithm show better in most data sets: take γ 1/2 σ21.0. For all algorithms, the set parameter combinations are adopted to randomly run for 20 times, and the result of DBI optimization is obtained.
TABLE 1 Algorithm parameter set
Figure GDA0003212852940000121
Table 2 shows DBI of various load clustering algorithms. As can be seen from table 2: in four data sets, 1) the integrated spectral clustering indexes considering DBI weighting are superior to a speu algorithm and a kmeu algorithm, compared with the speu algorithm, the indexes of the invention are respectively improved by 0.62%, 0.78%, 2.75% and 0.43%, and compared with the kmeu algorithm, the indexes are respectively improved by 30.2%, 41.3%, 27.7% and 9.67%, and the integrated spectral clustering can improve the clustering effectiveness by depending on an integrated learning idea; 2) mostly, indexes of the speu-gamma algorithm are superior to those of the speu algorithm, but optimal scale parameters of the speu-gamma algorithm are inconsistent in different data sets, and thus, the fact that the spectral clustering needs to debug the scale parameters again for different load data sets is proved; 3) indexes of the kmeu algorithm and the km-ag algorithm are inferior to those of the spectral clustering algorithm, and the index phase difference values are-0.293 and-0.223 in the mean values of all data sets respectively; 4) in the data set of spring, summer and autumn, the index of the invention is superior to the speu-gamma algorithm, and in the data set of winter, the index of the invention is inferior to the speu-gamma algorithm; 5) the indexes of the integrated spectral clustering algorithm ESC with the DBI weighting considered are superior to those of the integrated spectral clustering algorithm ESC-1 without the weighting considered, the indexes of the ESC-1 algorithm in the data sets of summer and winter are inferior to those of the speu algorithm, the reason is that the DBI index of the spectral clustering result based on the difference cosine distance in the basic clustering model is poor, the integrated clustering performance is influenced, and the integration without considering the effectiveness of the basic clustering model is proved to influence the effectiveness and the robustness of the integrated clustering method.
TABLE 2 clustering result DBI comparison of six classes of algorithms
Figure GDA0003212852940000131
(2) Integrated spectral clustering and internal evaluation index verification considering double-scale similarity
And comparing the performance of the multi-class load clustering algorithm with the performance of the integrated spectral clustering method considering the similarity of double scales on the four-season load data set. The comparison algorithm comprises the following steps: 1) a k-means algorithm for measuring similarity by Euclidean distance, which is called kmeu for short; 2) a k-means algorithm for measuring similarity by using differential cosine distance, which is abbreviated as kmco; 3) a spectral clustering algorithm which measures similarity by Euclidean distance and has fixed scale parameters is called speu for short; 4) a spectral clustering algorithm which measures similarity by using differential cosine distance and has fixed scale parameters is called spco for short; 5) a spectral clustering algorithm which measures similarity by comprehensive distance and optimizes scale parameters is called spec-gamma for short; 6) and a two-layer clustering algorithm, namely km-ag for short.
The algorithm parameters are shown in table 3. For all algorithms, the set parameter combinations are adopted to randomly run for 20 times, and the optimal result of the MDBI is obtained.
TABLE 3 Algorithm parameter set
Figure GDA0003212852940000141
Table 4 shows the DBI of each type of load clustering algorithm. As can be seen from table 4: in four data sets, 1) the integrated spectral clustering algorithm ESC indexes considering MDBI weighting are superior to other algorithms, the MDBI is respectively promoted by 0.45%, 18.68%, 4.42% and 0.43% compared with the spec-gamma algorithm, and is respectively promoted by 0.23%, 1.84%, 9.32% and 2.33% compared with the speu algorithm, so that the effectiveness of the integrated spectral clustering is superior to that of a single spectral clustering algorithm when load double-scale similarity clustering is comprehensively considered, and the robustness is superior; 2) the spec-gamma algorithm is superior to the speu algorithm only in MDBI performance in autumn and winter, and the defect of robustness of a single spectral clustering algorithm is proved; 3) the optimal scale parameters of the spec-gamma algorithm in the four data sets are inconsistent, and the scale parameters of the spectral clustering which need to be debugged again for different load data sets are verified again; 4) the classical k-means algorithm and the two-layer clustering algorithm have the km-ag index which is inferior to the performance of the spectral clustering algorithm.
TABLE 4 seven-class Algorithm clustering results MDBI comparison
Figure GDA0003212852940000151
Fig. 2 shows the load curve integrated spectral clustering results considering dual-scale similarity for the autumn load dataset. It can be seen that the integrated spectral clustering method classifies autumn loads into three clusters, and three typical load forms can be respectively summarized as a single peak type, a peak avoiding type I and a peak avoiding type II. The first type of load has morphological characteristics that the load climbs in the morning, the load is relatively stable in the daytime, and the load is reduced in the evening and in the morning; the morphological characteristics of the second type of load mainly show that the load is rapidly reduced at early morning and relatively stable in other periods; the third category of load is characterized by a rapid fall in the early morning and a rapid rise in the evening. The three types of loads have larger difference in distance and form change, so that the load curve integrated spectrum clustering result considering the similarity of double scales is reasonable and effective.
(3) Integrated spectral clustering and external evaluation index verification considering double-scale similarity
Two new example datasets were constructed, each as follows: 1) data set D1, number of load class clusters k16, each comprising 5 to 30 curves, which are different, for a total of 105 load curves; 2) data set D2, number of load class clusters k2Each class contains about 20 curves for a total of 160 load curves. Given the two dataset truth classification labels as shown in fig. 3 and 4.
The performance of the multi-class load clustering algorithm was compared to the performance of the integrated spectral clustering method considering the dual-scale similarity on the new data sets D1 and D2. The comparison algorithm comprises the following steps: 1) a k-means algorithm for measuring similarity by Euclidean distance, which is called kmeu for short; 2) a k-means algorithm for measuring similarity by using differential cosine distance, which is abbreviated as kmco; 3) a spectral clustering algorithm which measures similarity by Euclidean distance and has fixed scale parameters is called speu for short; 4) a spectral clustering algorithm which measures similarity by using differential cosine distance and has fixed scale parameters is called spco for short; 5) a spectral clustering algorithm which measures similarity by comprehensive distance and optimizes scale parameters is called spec-gamma for short; 6) and a two-layer clustering algorithm, namely km-ag for short.
The specific algorithm parameter settings are shown in table 5. For all algorithms, the set parameter combination is adopted to randomly run for 20 times, and the result of ARI optimization is obtained.
TABLE 5 Algorithm parameter set
Figure GDA0003212852940000161
Table 6 gives the ARI for each type of load clustering algorithm. As can be seen from table 6: in two data sets, 1) the ARI of the integrated spectral clustering method considering the double-scale similarity is better than or equal to the spec-gamma algorithm, the ARI in the data set D2 is improved by 1.52-24.7%, and the integrated spectral clustering proves that the distinguishing capability of the load morphological characteristics is better than that of a single spectral clustering algorithm when the load double-scale similarity clustering is comprehensively considered, and the robustness is better; 2) the spectral clustering algorithm speu or spco which measures similarity by using single Euclidean distance or difference cosine distance has larger ARI fluctuation in the two data sets, which proves that the single distance measurement mode has defects in measuring load morphological characteristics; 3) the two-layer clustering algorithm km-ag is better in the data set D1, but is inferior to the ESC algorithm, speu algorithm and spec-gamma algorithm in the data set D2, and the classic k-means algorithm ARI is inferior to the ESC method.
TABLE 6 ARI comparison of clustering results of seven classes of algorithms
Figure GDA0003212852940000171
The above description of the embodiments of the present invention is provided in conjunction with the accompanying drawings, and not intended to limit the scope of the present invention, and all equivalent models or equivalent algorithm flows made by using the contents of the present specification and the accompanying drawings are within the scope of the present invention by applying directly or indirectly to other related technologies.

Claims (4)

1. A load curve integrated spectral clustering method considering double-scale similarity is characterized by comprising the following steps: firstly, calculating load form change similarity through cosine distance of load differential vectors, and constructing a dual-scale similarity measurement mode for measuring load similarity; then, taking spectral clustering as a base clustering model generation algorithm, setting different clustering numbers and randomly operating to construct a differential base clustering model by selecting different similarity measurement modes, and ensuring diversity of the base clustering model; finally, a weighted consistency matrix and spectral clustering are used as a clustering integration strategy, the Theisenberg index DBI or a new index MDBI is used as a clustering evaluation index in the clustering integration process, the reciprocal of the DBI or the MDBI is used as a weight self-adaptive setting basis to calculate the weighted consistency matrix, the spectral clustering is used for realizing the final integrated clustering division, and the clustering performance improvement and the effective combination of the two measurement modes are realized through the integrated clustering;
selecting a Davison baud index DBI and adjusting a landed index ARI as internal and external evaluation indexes of a clustering effect of the method, and considering that Euclidean distances are adopted in a classic DBI formula to measure distances of different data samples, the result effectiveness of the clustering method adopting other similarity measurement modes cannot be accurately evaluated, so that the comprehensive distance is applied to distance calculation of the DBI to construct a new index MDBI, namely:
Figure FDA0003212852930000011
in the formula, MDBI is a new index used for evaluating the effectiveness of the load clustering result considering the double-scale similarity; dsiAnd dsjRespectively the average integrated distance from the sample in the ith class to the class center of the sample and the average integrated distance from the sample in the jth class to the class center of the sample; ds (C)i,Cj) Representing the combined distance of class centers of the ith and jth classes;
the load curve comprehensive distance based on the double-scale similarity is constructed by combining the difference cosine distance and the Euclidean distance, the load distance and the similarity degree of the morphological change are considered, and the comprehensive distance can be obtained by a linear function:
dsii′=ae·deii′+ac·dcii′·r
in the formula (ds)ii'Representing the integrated distance of the ith and ith' load curves; deii'The Euclidean distance of the ith and ith' load curves; dcii'The difference cosine distance of the ith and ith' load curves; a ise、acRespectively the weight coefficients of the Euclidean distance and the difference cosine distance when calculating the comprehensive distance, considering that the two similarities are effective measurement modes, aeAnd acAll are taken as 0.5; because the value ranges of the euclidean distance and the differential cosine distance are not consistent, the differential cosine distance needs to be amplified by r times, wherein r is a proportionality coefficient and is calculated as follows:
Figure FDA0003212852930000021
in the formula, demax、dcmaxRespectively representing the maximum value of Euclidean distances and the maximum value of difference cosine distances of all load curves in the data set;
the method for integrating the base clustering model by adopting a weighted consistency matrix method comprises the following steps:
and (3) combining the cluster evaluation indexes of different base cluster models, and taking the cluster effectiveness into consideration in the weighted consistency matrix calculation process to perform self-adaptive weight setting: when only the distance difference of the curves is considered and the clustering performance is optimized through integrated clustering, the DBI can be adopted to calculate the weight of the base clustering model; when the distance of the load curve and the form change difference are comprehensively considered, the MDBI can be adopted to calculate the base clustering weight; since the smaller the DBI and the MDBI are, the higher the effectiveness of the clustering result is represented, the weight value of the base clustering model is the reciprocal of the DBI or the MDBI of the corresponding base clustering model, and the weighting consistency matrix is defined as follows:
Figure FDA0003212852930000022
Figure FDA0003212852930000023
in the formula, wbThe weight of the b-th base clustering model when the weighted consistency matrix is calculated; inbRepresenting a clustering evaluation index of the b-th base clustering model, wherein the index can be DBI or MDBI; i { } is an indicator function, and when the formula is established in brackets, the value is 1, otherwise, the value is 0; l isb(i) And Lb(i ') respectively representing the class label of the ith sample in the b-th base clustering model and the class label of the ith' sample in the b-th base clustering model; the second equation scales the weights of the basis clustering models to make the sum of the weights 1, so as to make the value range of the elements of the weighted consistency matrix be [0, 1%];
The weighted consistency matrix can be regarded as a similarity matrix reflecting the sample similarity, the similarity matrix is processed by adopting spectral clustering, and the spectral clustering in the integration process also adopts a graph cutting mode of NCut and selects k-means to cluster the feature matrix.
2. The method for load curve ensemble spectral clustering considering dual-scale similarity according to claim 1, wherein, in order to avoid the influence of the load curve amplitude difference on the load morphology similarity calculation result, the load curve is subjected to maximum normalization before the method is run.
3. The method for clustering load curves with consideration of double-scale similarity according to claim 2, wherein the normalized load curve data is subjected to first-order difference operation, and then the cosine distance of the first-order difference vector of the load, i.e. the difference cosine distance, is calculated to reflect the consistency of the morphological changes of the two load curves.
4. The load curve integrated spectral clustering method considering the double-scale similarity as claimed in claim 1, wherein the construction method of the differentiated basis clustering model is as follows: selecting spectral clustering as a base clustering method, taking a fixed value for a scale parameter of the spectral clustering, and ensuring the diversity of a base clustering model through the following three aspects: 1) the similarity measurement mode adopts Euclidean distance or difference cosine distance; 2) setting different cluster numbers with the value range of [ k ]min,kmax]Each of which is an integer; 3) the method for setting each pair of parameter combination of the similarity measurement mode and the cluster number randomly runs for multiple times, and the number of times is p;
in an undirected graph segmentation mode of spectral clustering, an NCut graph segmentation method is selected to process an undirected weight graph obtained by a similar matrix, and a feature matrix obtained after a dimension reduction in the graph segmentation process is clustered by k-means.
CN202010699981.9A 2020-07-20 2020-07-20 Load curve integrated spectrum clustering method considering double-scale similarity Active CN111723876B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010699981.9A CN111723876B (en) 2020-07-20 2020-07-20 Load curve integrated spectrum clustering method considering double-scale similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010699981.9A CN111723876B (en) 2020-07-20 2020-07-20 Load curve integrated spectrum clustering method considering double-scale similarity

Publications (2)

Publication Number Publication Date
CN111723876A CN111723876A (en) 2020-09-29
CN111723876B true CN111723876B (en) 2021-09-28

Family

ID=72572914

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010699981.9A Active CN111723876B (en) 2020-07-20 2020-07-20 Load curve integrated spectrum clustering method considering double-scale similarity

Country Status (1)

Country Link
CN (1) CN111723876B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112819299A (en) * 2021-01-21 2021-05-18 上海电力大学 Differential K-means load clustering method based on center optimization
CN112837188A (en) * 2021-03-10 2021-05-25 科豆(福州)教育科技有限公司 Research and travel intelligent planning method based on transfer learning and clustering algorithm
CN113489008A (en) * 2021-09-07 2021-10-08 国网江西省电力有限公司电力科学研究院 Multi-type energy supply and utilization system equivalence method based on real-time dynamic correction
CN117131397A (en) * 2023-09-04 2023-11-28 北京航空航天大学 Load spectrum clustering method and system based on DTW distance
CN117236803B (en) * 2023-11-16 2024-01-23 中铁二十二局集团电气化工程有限公司 Traction substation grading and evaluating method, system and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107657266A (en) * 2017-08-03 2018-02-02 华北电力大学(保定) A kind of load curve clustering method based on improvement spectrum multiple manifold cluster
CN108199404A (en) * 2017-12-22 2018-06-22 国网安徽省电力有限公司电力科学研究院 The spectral clustering assemblage classification method of high permeability distributed energy resource system
CN108805213A (en) * 2018-06-15 2018-11-13 山东大学 The electric load curve bilayer Spectral Clustering of meter and Wavelet Entropy dimensionality reduction

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9639642B2 (en) * 2013-10-09 2017-05-02 Fujitsu Limited Time series forecasting ensemble

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107657266A (en) * 2017-08-03 2018-02-02 华北电力大学(保定) A kind of load curve clustering method based on improvement spectrum multiple manifold cluster
CN108199404A (en) * 2017-12-22 2018-06-22 国网安徽省电力有限公司电力科学研究院 The spectral clustering assemblage classification method of high permeability distributed energy resource system
CN108805213A (en) * 2018-06-15 2018-11-13 山东大学 The electric load curve bilayer Spectral Clustering of meter and Wavelet Entropy dimensionality reduction

Also Published As

Publication number Publication date
CN111723876A (en) 2020-09-29

Similar Documents

Publication Publication Date Title
CN111723876B (en) Load curve integrated spectrum clustering method considering double-scale similarity
CN108805213B (en) Power load curve double-layer spectral clustering method considering wavelet entropy dimensionality reduction
CN110738232A (en) grid voltage out-of-limit cause diagnosis method based on data mining technology
CN110364264A (en) Medical data collection feature dimension reduction method based on sub-space learning
Liu et al. An unsupervised feature selection algorithm: Laplacian score combined with distance-based entropy measure
CN105046323A (en) Regularization-based RBF network multi-label classification method
Xueli et al. An improved KNN algorithm based on kernel methods and attribute reduction
CN113159160B (en) Semi-supervised node classification method based on node attention
CN110929761A (en) Balance method for collecting samples in situation awareness framework of intelligent system security system
CN104573726B (en) Facial image recognition method based on the quartering and each ingredient reconstructed error optimum combination
CN113780343A (en) Bilateral slope DTW distance load spectrum clustering method based on LTTB dimension reduction
CN117726939A (en) Hyperspectral image classification method based on multi-feature fusion
Mao et al. Naive Bayesian algorithm classification model with local attribute weighted based on KNN
Du et al. An Improved Algorithm Based on Fast Search and Find of Density Peak Clustering for High‐Dimensional Data
CN103761433A (en) Network service resource classifying method
CN114358207A (en) Improved k-means abnormal load detection method and system
Qin Software reliability prediction model based on PSO and SVM
Wang et al. Analysis of user’s power consumption behavior based on k-means
CN113159132A (en) Hypertension grading method based on multi-model fusion
Ding et al. Time-varying Gaussian Markov random fields learning for multivariate time series clustering
Feng Analysis on algorithm and application of cluster in data mining
CN113723835B (en) Water consumption evaluation method and terminal equipment for thermal power plant
CN103226710B (en) Based on the method for classifying modes differentiating linear expression
Wang et al. Research on the Urban Construction Status of Prefecture-Level Cities in Heilongjiang Province Based on SPSS Analysis
CN116452910B (en) scRNA-seq data characteristic representation and cell type identification method based on graph neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant