CN117609818A

CN117609818A - Power grid association relation discovery method based on clustering and information entropy

Info

Publication number: CN117609818A
Application number: CN202311583856.1A
Authority: CN
Inventors: 王宏志; 郑胜文; 刘怀远; 陈兴雷; 文晶; 李文臣; 崔勇; 顾军
Original assignee: Harbin Institute of Technology; China Electric Power Research Institute Co Ltd CEPRI; State Grid Shanghai Electric Power Co Ltd
Current assignee: Harbin Institute of Technology; China Electric Power Research Institute Co Ltd CEPRI; State Grid Shanghai Electric Power Co Ltd
Priority date: 2023-11-24
Filing date: 2023-11-24
Publication date: 2024-02-27

Abstract

The method for discovering the association relation of the power grid based on the clustering and the information entropy solves the problem that the association relation of the power grid operation mode is not easy to analyze, and belongs to the field of power grid analysis. The invention comprises the following steps: performing preliminary selection of key features according to the original power grid system data to obtain a plurality of groups of primary screening features; clustering in each feature group by using an improved DBSCAN clustering algorithm, wherein the clustering result is a secondary screening feature; calculating rank correlation coefficients of the secondary screening features and the section power transmission capacity as differential measures of the features, further calculating cosine similarity d, clustering the secondary screening features by using a clustering algorithm to obtain features corresponding to K power grid operation modes, further constructing decision trees, adding random disturbance to each feature based on the decision trees, analyzing influence of different feature disturbance on classification, and finally obtaining coarse-granularity association relation of each feature and the section power transmission capacity.

Description

Power grid association relation discovery method based on clustering and information entropy

Technical Field

The invention relates to a power grid association relation discovery method based on clustering and information entropy, and belongs to the field of power grid analysis.

Background

The association relation discovery method is widely applied to actual scenes such as industry, medicine, business and the like, in the actual scenes, a large amount of data is often generated, a lot of important information is hidden in the actual scenes, and interaction effect among the data in the actual scenes can be mined through the association relation discovery method, so that professional analysis in the relevant scenes is facilitated. The association relation discovery can be realized by using technologies such as clustering, classification, association learning and the like, different methods suitable for application scenes can be adopted in actual scenes, for example, a drug reaction database is constructed by Vougas K and the like, and the association relation between cancers and drug reactions is mined based on an association rule mining technology, so that a certain guiding effect is achieved on clinical drug treatment; gao H and the like excavate hidden association relations in the industrial Internet of things through collaborative learning technology, and plays an auxiliary role in scenes such as intelligent manufacturing, video monitoring and the like; the Kara M E and the like find the association relation between the risk behaviors and the suppliers through a clustering algorithm, and provide assistance for the suppliers in reducing related risks and eliminating the risky suppliers for business companies. Therefore, the association relation discovery method has many application scenes facing a power grid system with a large amount of data information.

The formulation of the power grid operation mode is largely based on power grid simulation analysis and calculation, such as PSASP, PSD-BPA and other power grid simulation software, and the processing, analysis and judgment of simulation results are mainly based on manual work at present. Researchers extract data of the power grid system according to expert knowledge, model the power grid system according to a typical operation mode of the power grid so as to analyze the power grid system, search key characteristics affecting the power grid, then adjust the key characteristics, formulate a new power grid operation mode, adjust section power transmission capacity and need the section power transmission capacity of the power grid when finding the operation boundary of the system. The traditional analysis method has a plurality of problems, a researcher with expert experience knowledge is required, the requirement on personnel is high, the consumption of human resources is serious, and the phenomenon of error and leakage is easy to cause through manual adjustment. The requirement of gradually fusing the artificial intelligence technology and the power grid safety and stability analysis also appears at present, however, the technology of realizing the power grid simulation analysis by adopting the artificial intelligence technology, which is widely researched at present, is not free of a problem, namely, the association relation between the power grid operation mode and the section power transmission capacity is found, and the analysis and the adjustment of the power grid can be better performed only by finding the association relation between the power grid operation mode and the section power transmission capacity, so that the power transmission capacity of the section of the power grid is increased, the safety and stability of the power transmission are improved, and the power transmission efficiency is increased.

Therefore, the method realizes the discovery and analysis of the association relation between the power grid operation mode and the section power transmission capability, can effectively reduce the workload of calculation staff, improve the analysis precision, reduce decision errors, adapt to the power grid mode calculation requirement under new situation, and greatly change the existing working mode which is seriously dependent on manpower.

Disclosure of Invention

Aiming at the problem that the association relation of the power grid operation mode is not easy to analyze, the invention provides a power grid association relation discovery method based on clustering and information entropy.

The invention discloses a power grid association relation discovery method based on clustering and information entropy, which comprises the following steps:

s1, performing preliminary selection of key features according to original power grid system data to obtain multiple groups of power grid operation mode features, wherein each group is of one type, and the power grid operation mode features are primary screening features;

s2, carrying out normal distribution inspection on the primary screening features of each group, determining whether the primary screening features have normal, calculating rank correlation coefficients among features in the corresponding feature groups for the features without the normal, taking the reciprocal of the rank correlation coefficients as sample distances of a DBSCAN clustering algorithm, and clustering the primary screening features in each feature group by using the DBSCAN clustering algorithm, wherein a clustering result is a secondary screening feature;

S3, calculating rank correlation coefficients of secondary screening features and section power transmission capacity as differential measures of the features, calculating cosine similarity d according to the differential measures, clustering the secondary screening features by using a K-means++ clustering algorithm to obtain features corresponding to K power grid operation modes, constructing a decision tree by using information entropy gains of the classification and classification of the features of the K power grid operation modes, adding random disturbance to each feature based on the decision tree, analyzing influence of different feature disturbance on classification, and finally obtaining coarse-granularity association relation of each feature and the section power transmission capacity.

In the S1, eight groups of power grid operation mode characteristics are obtained, wherein the categories are load level, direct current power, generator starting, generator power, rotary backup, line switching, bus voltage and secondary branch power near a section.

Preferably, in S2, a normal distribution test is performed on the primary screening characteristics of each group by using a K-S test method to determine whether the primary screening characteristics are normal.

Preferably, S2 further includes calculating variances for different features of the feature without normalization, directly removing the feature with the variance of 0, and calculating rank correlation coefficients between features in the corresponding feature group after variance screening.

Preferably, in S2, the sample distance dist (x _i ,x _j ) The method comprises the following steps:

wherein I (x) _i ,x _j ) Sample x indicating no normalization in feature group after variance screening _i And sample x _j Rank correlation coefficients between;

from the sample distance dist (x _i ,x _j ) Constructing a sample distance matrix dist (X):

x ₁ ,…,x _n representing n sample points within a feature set, i=1, 2, …, n;

the method for clustering in each feature group by using DBSCAN clustering algorithm comprises the following steps:

s21, determining eps and MinPts by adopting a k nearest neighbor method and an average contour coefficient combination method according to a sample distance matrix dist (X);

eps represents the maximum radius of the density cluster circle, minPts represents the minimum number of samples contained in the circle;

s22, meeting the requirements of |N _eps (x _i ) The sample points with the I being more than or equal to MinPts are randomly selected and used as core points, and a new category is created or an existing category is expanded according to the core points; n (N) _eps (x _i ) Representing sample point x _i Is a neighborhood of eps;

s23, selecting points with reachable density according to dist (X), eps and MinPts to expand the category;

s24, repeating S22 and S23, if the found sample point x _i Satisfy |N _eps (x _i )|<MinPts, sample point x _i As a noise point, if the sample point x _i Can be from a certainThe core point density is reached, the sample point x _i Into categories of the certain core point.

Preferably, in S3, the cosine similarity d is:

wherein, the two features are respectively x= (x) ₁ ,x ₂ ,…,x _n ) And y= (y) ₁ ,y ₂ ,…,y _n ) All are n-dimensional data points, and the characteristic differentiation measure is r= (r) ₁ ,r ₂ ,…,r _n )。

Preferably, in S3, the secondary screening features are clustered by using a K-means++ clustering algorithm, so as to obtain a classification label of each feature:

selecting initial K clustering centers according to the following clustering ideas:

randomly selecting one sample point in the secondary screening characteristics and distributing the sample point to a first clustering category as the center of the first clustering category;

calculating the distance between the rest sample points and the first sample point to obtain a differential measurement, and selecting the sample point with the largest differential measurement value to be distributed to a second clustering class to be used as the center of the second clustering class;

calculating the center points of the first sample point and the second sample point, calculating the distance between the center point and the rest sample points, selecting the point with the largest distance value from the rest sample points, and distributing the point to the third clustering category as the center of the third clustering category.

Preferably, the method for constructing the decision tree by classifying and dividing the information entropy gains through the characteristics of k power grid operation modes comprises the following steps:

Dividing the characteristics of k power grid operation modes by adopting a dichotomy, calculating information entropy gains of dividing threshold points, and determining the current optimal dividing characteristics and optimal dividing points according to the information entropy gain values to form an initial decision tree;

and (3) adopting a maximum depth limiting method, pruning nodes exceeding the limit, and determining a final decision tree.

The method has the beneficial effects that the method can comprehensively and effectively define the power grid operation mode, is convenient for researching tasks related to the power grid operation mode, simultaneously effectively analyzes the association relation of the power grid operation mode, differentially applies weights to different features, and ensures that the classification result is more accurate, so that the association relation of multiple operation modes can be found.

Drawings

FIG. 1 is a schematic diagram of an IEEE39 node power grid system.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.

The invention is further described below with reference to the drawings and specific examples, which are not intended to be limiting.

The power grid operation mode refers to an aggregate formed by various characteristics of various components of a certain power grid at a certain moment in the power field, and such definition is complicated and redundant in the past for specific power grid problems, so that computer related technologies are required to be used for researching the power grid related problems, and the power grid operation mode definition and characteristic selection in the computer field need to be provided. The feature selection process needs to look at the relevant tasks, so the feature selection of the grid operation mode also needs to look at the relevant tasks of the grid. The key generator set is a key consideration object for the power flow adjustment task or the section power transmission capacity calculation and adjustment task of the power grid, the power grid load distribution can be considered for some section related tasks, and the power of an alternating current circuit is added on the basis of the power generator and the load in the typical scene extraction research of the power grid operation mode.

However, the feature selection mode in the tasks has the advantages that the dimension is less in consideration during feature selection, the selected features are not enough in pertinence, the follow-up task execution is influenced, specific problems are required to be faced, feature selection is comprehensively and pertinently performed, therefore, in the embodiment, expert knowledge is firstly used for analyzing key components in a power grid system, key features are comprehensively selected, the power grid operation mode features are primarily determined, and due to the characteristics of high dimension, periodicity, multiple similar components and the like of the power grid operation mode features, rank correlation coefficients are introduced into a DBSCAN clustering algorithm for improvement, secondary screening of the power grid operation mode features is achieved, and finally the power grid operation mode feature selection method is provided.

The method for discovering the association relation between the clusters and the power grid based on the information entropy comprises the following steps:

step 1, performing preliminary selection of key features according to original power grid system data to obtain a plurality of groups of power grid operation mode features, wherein each group is of one type, and the power grid operation mode features are primary screening features;

the step 1 can adopt preliminary feature selection based on expert knowledge: the actual grid contains a large number of components such as generators, bus bars, ac lines, loads, etc., and a classical grid system is shown in fig. 1;

the large power grid for real application has several hundreds and thousands of generators, and other components are more, and have very large parameter quantity, and each component itself also has very complex parameters, taking the generators in the power grid as an example, and the parameters are shown in table 1:

table 1 list of generator parameters

Table 1 (continuous table) generator parameter list

This is merely a parameter of a generator, where the generator is given active power, given reactive power, given voltage, etc. all are of interest, and other components such as ac line voltage, load given active power reactive power, dc line running power, bus voltage, transformer control information, etc. if feature screening is not performed, features are very redundant, resulting in a significant increase in computation cost, and because of the redundancy of features, the finally calculated correlations may be not highly differentiated and difficult to play a role in facilitating the adjustment of section power transmission capability, so how to find key features or a method for performing feature combination is very important.

In fact, the whole grid can be abstracted into three parts, namely, power is generated from the power generation side, passes through the tide line until the load is put into practical use, and the power is influenced on the section power transmission capacity, and then the three parts are analyzed.

First, the power generation side. The power generation side is the start of the power of the whole power grid system, and the generator can have great influence on the data of all components in the power grid system, so that the active power, reactive power, power on and the like of the generator can have great influence on the power grid system. With the continuous expansion of the scale of the power grid system nowadays, the generator set has more and more different modes, the influence on the operation of the whole power grid is larger and larger, with the formation of a large power grid interconnection pattern, a single generator pattern is weakened, and the power grid system is divided into areas, so that the characteristics of the generator set in a certain area are focused. Meanwhile, the new energy power generation scale is increasingly enlarged, power generators with different sources and different types are connected into a power grid system, the attention to a single power grid can make parameters more complex, the trend change during adjustment can be more complex, the power grid configuration is difficult to adjust and carry out, and great adjustment is brought to the safety of the power grid system. Therefore, the power generation side needs to be subjected to regional processing, the data of the generator set is extracted according to different regions, and the influences of the active power of the generator, the starting of the generator and the rotating of the generator on the running mode of the power grid are most important, so that the three characteristics are selected, and the extraction processing is performed according to the regional processing, so that the three groups of characteristics are obtained.

Followed by a tidal current line. The power reaches the load from the power generation side, and needs to pass through the lines such as buses, alternating current lines, direct current lines and the like, and the section also consists of some alternating current lines, and the economy, the reliability and the like of the power flow line for the whole power grid depend on the power flow conditions in the lines, so that the characteristic extraction of the intermediate line is also indispensable. Firstly, determining which alternating current lines are included in a section to be selected, then carrying out breadth-first search according to a constructed power grid network diagram, selecting buses, alternating current lines and the like near the section, and extracting characteristic data in a targeted manner.

And finally the load. The load is a consumable occurrence in the whole grid system, which is the occurrence of a service load, and therefore it is a component of critical importance for the section power transmission capacity. Which is closely related to the power of the generator and affects the line flow conditions in the whole grid. The load is also various like a generator, such as residential electricity, railway power supply and the like, each load has different detail characteristics, if a single load condition is concerned, parameters are complicated, model analysis is difficult to develop, and therefore, a strategy of feature extraction on the power generation side is also needed, namely, given active power of the load is subjected to feature extraction treatment according to regional.

Through analysis, the power grid operation mode for the problem related to the section power transmission capability available by the computer related technology is defined as eight groups of power grid operation mode characteristics, and the categories are respectively load level, direct current power, generator startup, generator power, rotary backup, line switching, bus voltage and secondary branch power near the section, as follows:

(1) Load level: and according to the regional extraction principle, respectively adding up given active power of load data of the power grid system according to the whole network and the regional number for statistics. The feature quantity in the group is related to the power grid system and the regional division mode thereof, taking a northeast power grid as an example, the regional division is divided into 76 regions, so that the total network data is added, and 77 features are added in the feature group.

(2) Direct current power: since the number of the direct current lines in the power grid system is generally very small, all direct current line data in the network can be extracted, so that the pole 1 and the pole 2 of each direct current line in the power grid system are added up to take absolute values, and each line is counted respectively. The number of features in the group is the same as the number of effective direct current lines in the studied power grid system.

(3) Starting the generator: and according to the regional extraction principle, the starting numbers of the generators in the power grid system are respectively counted by adding up the whole network and the regional numbers. The feature quantity in the group is the same as the feature quantity determination method in the load level group.

(4) Power of the generator: and according to the regional extraction principle, respectively adding up the active power of the generator in the power grid system according to the whole network and the regional number for statistics. The feature quantity in the group is the same as the feature quantity determination method in the load level group.

(5) Spin-on preparation: according to the regional extraction principle, subtracting a given active power from the upper limit of the active power of the generator with the generator effective bit of 1 in the power grid system as a rotary backup, and counting according to the whole network and the regional number. The feature quantity in the group is the same as the feature quantity determination method in the load level group.

(6) And (3) line switching: the effective bits of 20 alternating current lines near the alternating current line composing the section under study are taken as line switching to indicate whether the alternating current line is available.

(7) Bus voltage: the voltages of 20 bus bars (up to two sides are available) near the ac line that constitutes the section under investigation are taken.

(8) Power of secondary branch near section: the absolute values of the i-side power of 20 secondary reachable ac lines in the vicinity of the ac line constituting the section under investigation are taken and the ac line is ensured to be usable.

The power transmission section is based on a power grid system division method, and the section power transmission capacity refers to the capacity of power transmission on the section.

The preliminary feature selection based on expert knowledge is completed, the power grid operation mode feature definition in the aspect of a power grid system is converted into the power grid operation mode feature definition which is available in the relevant knowledge of the computer field and aims at the problem related to the section power transmission capacity, the defect of feature selection in the existing power grid relevant tasks is overcome, the power grid operation mode feature is selected more comprehensively and pertinently, and the follow-up task processing is facilitated.

Step 2, carrying out normal distribution inspection on the primary screening features of each group, determining whether the primary screening features have normal, calculating rank correlation coefficients among features in the corresponding feature groups for the features without the normal, taking the reciprocal of the rank correlation coefficients as the sample distance of a DBSCAN clustering algorithm, and clustering the primary screening features in each feature group by using the DBSCAN clustering algorithm, wherein the clustering result is a secondary screening feature:

the power grid operation mode feature data subjected to the primary feature screening still has high dimensionality, and the secondary screening of the primary screened features is required by using a feature selection method. In the current method for clustering the power grid operation mode, the main component analysis method is adopted when the feature dimension reduction is carried out, the method is one of feature extraction methods, and high-dimension features are converted into low-dimension features, so that the complexity of data is well reduced, but the method is not suitable for the research task in the application, because the feature interpretation after the feature dimension reduction is carried out by the main component analysis is poor, and the association relation between the feature and the section power transmission capacity found by the method is meaningless for the adjustment of the section power transmission capacity, so that the method cannot be used. The feature is clustered, and its parameters are all values of the feature, so that the feature is a cluster with a variation trend, and therefore, its shape is irregular, and it is necessary to use a density-based cluster. The DBSCAN method regards all the features as the same point, and the problem can be better solved by grouping and clustering the features. In the aspect of similarity measurement, since the inter-feature clustering is a change trend clustering, a simple distance measurement and similarity measurement method cannot meet the requirement, so that the similarity is measured by using a correlation coefficient, and the similarity measurement is performed by adopting a rank correlation coefficient because the power grid operation mode features are detected by normal distribution, and are found to be inconsistent with normal distribution. Meanwhile, for the DBSCAN method, the selection of the eps parameter and the MinPts parameter is important, and because the rank correlation coefficients among certain features are similar, the clustering result is sensitive to the selection of the eps parameter, the reciprocal of the rank correlation coefficient is introduced into the DBSCAN method, so that the sensitivity of the eps parameter is reduced, the selection of the values is facilitated, a better clustering effect can be obtained, and meanwhile, the relative proportion of the correlation coefficients among the features is reserved.

There are also some related researches for selecting the power grid operation mode features according to the correlation coefficients, however, the pearson correlation coefficients are used, and the features are not normally checked, which is unreasonable, because the pearson correlation coefficients have limitations in application range, if the data of the researched samples are normally distributed and linearly distributed, the pearson correlation coefficients can be used, but for the power grid operation mode data screened by the primary features, the normality cannot be guaranteed, and the problem of inaccurate correlation calculation can be caused by using the pearson correlation coefficients. Therefore, secondary feature selection is performed, first, primary screening features are screened from definition according to original power grid system data and expert knowledge, then normal distribution inspection is performed to obtain whether the primary screening features have normal property, rank correlation coefficients among features are calculated in the same type of feature groups when the primary screening features have no normal property, then the inverse of the rank correlation coefficients are taken as sample distances to be introduced into a DBSCAN clustering algorithm, clustering is performed according to feature group groups, and finally some noise samples and representative samples of class clusters are obtained to realize secondary feature selection.

The method for checking the normalization of the features of the preliminary screening is a K-S (Kolmogorov-Smirnov) test method, which performs cumulative distribution function and cumulative probability distribution function calculation on the feature data of the sample points, and firstly presumes H ₀ For the sample to be tested and normal distribution to have no significant difference, calculating the difference between the two function results, comparing the difference with the known probability value under certain confidence, obtaining the known probability value through table lookup, if the difference is smaller than the known probability value, then H ₀ If the confidence is established, otherwise, H cannot be determined ₀ This is true with a certain confidence.

D and P values can be obtained by a K-S test method. The D value represents the maximum distance between the tested sample distribution and the normal distribution, and the larger the value, the larger the difference, and the smaller the value, the more uniform the distribution. The P value represents the P value in the hypothesis test, and if the P value is greater than 0.05, the tested sample distribution can be confidence to be consistent with the normal distribution, otherwise, the P value cannot be determined.

For data that is not normally distributed, a two-step process is performed using variance and rank correlation coefficients. Firstly, the variance is calculated for different features respectively, and the features with the variance of 0 are directly removed, because the features are unchanged in all samples, the features show no correlation with other features in the original data, in other words, the correlations with other features cannot be calculated, so that the significance is not greatly reserved. After variance screening, correlation calculations were performed within eight sets of features, respectively, using rank correlation coefficients. The rank correlation coefficient is different from the pearson correlation coefficient, the data is not required to be normally distributed, and the difference of the sample amount class is eliminated. Assuming two features A and B are present, the inclusion data is { a }, respectively ₁ ,a ₂ ,…,a _n Sum { b } ₁ ,b ₂ ,…,b _n Two groups of data are respectively sequenced, and each data can obtain own sequencing index which is respectivelyAndthus, the sorting index difference value of the data corresponding to the features A and B is { d } ₁ ,d ₂ ,…,d _n }, whereinThe rank correlation coefficient formula is shown as formula (1):

the intra-group rank correlation coefficient can be used as a similarity measure between two features, a DBSCAN clustering method introducing the rank correlation coefficient is selected to be used for clustering in the group, the DBSCAN clustering method can obtain some density-based clustering clusters and some noises, core samples in the clustering clusters are representative of other samples in the clusters, and therefore other samples can be removed, and core sample points and noise points in each cluster are left to serve as secondary screening features.

Assume that there is a sample set x= { X ₁ ,x ₂ ,…,x _n The DBSCAN clustering algorithm has parameters eps and MinPts, eps represents the maximum radius of a density clustering circle, and MinPts represents the minimum sample number in the circle. The rank correlation coefficient is introduced into the DBSCAN clustering algorithm for improvement, and the following definition is helpful for understanding the DBSCAN algorithm flow and determining DBSCAN parameters:

definition 1: sample point x _i Is called N _eps (x _i ) Then N _eps (x _i ) The definition is shown in the formula (1-2):

N _eps (x _i )＝{x _j |dist(x _i ,x _j )≤eps} (2)

Wherein dist (x) _i ,x _j ) I.e. the sample point x _i And sample point x _j Distance metric function between.

Definition 2: a sample distance matrix is constructed for sample set X as shown in equation (1-3):

wherein dist (x) _i ,x _j ) The formula is shown as formula (1-4):

wherein I (x) _i ,x _j ) Representing sample point x _i And sample point x _j Rank correlation coefficient between.

Definition 3: if the sample point x _i The number of sample points in the eps neighborhood of (a) is greater than or equal to MinPts, the sample point x _i Referred to as core points.

Definition 4: if the sample point x _i Is the core point and sample point x _j At N _eps (x _i ) In (3), sample point x _j For a given eps parameter and MinPts parameter, it is from sample point x _i The direct density is achievable.

Definition 5: if the sample point x _j From sample point x _i The direct density is reachable and at the sample point x _j The number of sample points in the eps neighborhood of (a) is smaller than MinPts, the sample point x _j Is a boundary point.

Definition 6: if the sample point x _i Neither core nor boundary points, it is a noise point.

Definition 7: if there is a series of sample points p= { p ₁ ,p ₂ ,…,p _m All of these sample points belong to X, if p is _i+1 From p _i The density is up to and p ₁ ＝x _i ，p _m ＝x _j Sample point x _j For a given eps parameter and MinPts parameter, it is from sample point x _i The direct density is achievable.

Definition 8: if the sample point x _i And sample point x _j All from sample point x _k Density is reachable, then sample point x _i And sample point x _j Are density-connected.

Definition 9: cluster S is a non-empty subset of X and satisfies the condition: (1) For any sample point x _i And sample point x _j ，x _i Within cluster S, and sample point x _j From sample point x _i Density is reachable, then x _j Also within cluster S; (2) For any sample point x _i And sample point x _j If the sample point x _i And sample point x _j All in cluster S, then sample point x _j From sample point x _i The density is reachable.

The reciprocal of the rank correlation coefficient is selected to be used as the distance measure in the distance measure, so that the distance difference between the noise point and the clustering point is increased, the sensitivity of the eps parameter is reduced, the eps parameter with good clustering performance is more easily found, and the proportional relation of the rank correlation coefficient is maintained while the effects are achieved.

The determination of eps parameters and MinPts parameters is done jointly by the kth nearest neighbor method and the average profile factor. Firstly, according to the obtained sample distance matrix, the kth sample point closest to each sample is needed to be found, the distances of the kth sample point are recorded, finally, all the sample points are ordered according to the kth nearest sample distance, an image is drawn, a rough range is found according to the distance of the slope slowing position, DBSCAN clustering is carried out by using the distance in the range as an eps parameter, the average profile coefficient of a clustering result is calculated, the optimal eps parameter is finally obtained, and then the MinPts parameter is determined according to the same method. For sample set x= { X ₁ ,x ₂ ,…,x _n Sample point x _i The profile coefficient formula of (2) is shown in (5):

wherein a is _i Representing sample point x _i Average distance from other sample points in the same cluster, b _i Representing sample point x _i Average clusters with all sample points in the nearest class. In order to describe the overall clustering effect, each sample point needs to be subjected to contour coefficient calculation, and average value operation is adopted to obtain overall clustering effect evaluation, namely average contour coefficient, as shown in the formula (1-6):

sample point x _i Calculating sample point x by nearest class calculation _i The point with the lowest average distance from the average distance of all samples in a certain class is selected as the nearest class, as shown in the formula (1-7):

wherein C is _j Representing in addition to the sample point x _i The j-th category, m, outside the category _j Representing class C _j The number of samples contained.

step 21, determining eps and MinPts by adopting a k nearest neighbor method and an average contour coefficient combination method according to a sample distance matrix dist (X);

step 22, for meeting |N _eps (x _i ) The sample points with the I being more than or equal to MinPts are randomly selected and used as core points, and a new category is created or an existing category is expanded according to the core points; n (N) _eps (x _i ) Representing sample point x _i Is a neighborhood of eps;

step 23, selecting points with reachable densities according to dist (X), eps and MinPts to expand the category;

step 24, repeating step 22 and step 23 if the found sample point x _i Satisfy |N _eps (x _i )|<MinPts, sample point x _i As a noise point, if the sample point x _i Can be reached from a certain core point density, then the sample point x _i Into categories of the certain core point.

And respectively executing an improved DBSCAN clustering algorithm for each type of characteristics in the initially screened power grid operation mode characteristics to obtain clusters of each type of characteristics. A representative feature needs to be selected from one clustering category as a clustering category representative, so that secondary feature selection is realized, and the selection principle is that the average distance between the point and other sample points in the same category is minimum, and the average distance between the point and noise points and other categories is maximum, as shown in a formula (8) and a formula (9):

wherein the method comprises the steps ofRepresents x _i The category includes the number of sample points.

Because the power grid operation mode sample itself does not have a classification label, each sample point classification label can be given only by carrying out cluster analysis on the power grid operation mode sample, and the importance degree of the characteristics can be calculated through information entropy according to the classification label to be used as an association relation. At present, most of clustering researches on the power grid operation mode are focused on the aspects of related problems of the power grid operation mode, and similar clustering researches on aspects of related problems of sections do not exist, so that the clustering problems of the power grid operation mode aiming at key sections need to be researched. The method is characterized in that the method comprises the steps of using rank correlation coefficients of features and section power transmission capacity after secondary screening as differential measurement, introducing the measurement method into a K-means++ clustering method commonly used in power grid operation mode clustering, improving the measurement method, completing targeted clustering of the power grid operation mode, and introducing fine granularity association relations found by a deep convolution network into the power grid operation mode clustering as differential measurement.

The information entropy is firstly proposed by shannon, and represents the disorder degree of the information, and the information entropy is larger as the information is more disordered, and is smaller as the information is conversely. It can then be used to see how much the different features improve on the overall sample set, i.e. how important the different features are. In fact, the coefficient of the key can also be used as the feature importance degree discovery, however, the distribution of the power grid operation mode feature data is relatively chaotic, and the coefficient of the key is insufficient in the data with larger confusion degree, so that the information entropy can divide a chaotic system. And therefore, adopting information entropy to perform association relation discovery.

The present embodiment obtains a coarse-grained association relationship, that is, an association relationship between each group and a section power transmission capability is obtained by using the group as a unit. This is because in the secondary screening step, some features with little variance or represented by other similar features are removed, so that information in some features is ignored, and pruning operations are performed to prevent overfitting when the entropy gain of information builds a decision tree, resulting in many features not being considered. The association relation on the final group level is obvious, and the association relation of each feature is not good, so that the method obtains the coarse-granularity association relation which can be used as a reference for the association relation between each group and the section power transmission capacity, and can also be used as a verification for the obtained association relation.

According to the method for finding the association relation between the clusters and the information entropy, firstly, the rank correlation coefficient of the secondary screening feature and the section power transmission capability is calculated to serve as the differential measurement of the feature, the differential measurement is introduced into the length-based cosine similarity calculation, the cosine similarity is subsequently used as the distance measurement to be introduced into the K-means++ clustering algorithm to realize the clustering of the power grid operation mode, then a decision tree is constructed according to the information entropy gain of a sample, random disturbance is added to each feature, and the coarse-grained association relation between each feature and the section power transmission capability is finally obtained.

S3, calculating rank correlation coefficients of secondary screening features and section power transmission capacity as differential measures of the features, calculating cosine similarity d according to the differential measures, clustering the secondary screening features by using a K-means++ clustering algorithm to obtain features corresponding to K power grid operation modes, constructing a decision tree by using information entropy gains of the features of the K power grid operation modes on classification division, adding random disturbance to each feature based on the decision tree, analyzing influence of different feature disturbance on classification, and finally obtaining coarse-granularity association relation of each feature and the section power transmission capacity:

K-means++ is used for the typical scene extraction of the power grid operation mode, however, the clustering of the power grid operation mode does not consider the subsequent specific task, but is classified into a general class, and has no much guiding effect on specific problems, so that a specific power grid operation mode clustering method is needed. Because the purpose of the clustering of the application is to find the association relation between the power grid operation mode and the section power transmission capacity, the characteristics screened for the second time are required to be processed in a targeted manner.

In clustering, sample features have different importance degrees on clustering results, and after standardized normalization processing, differences among dimensions are eliminated, which is unreasonable and not targeted, so a differential measurement method is needed to measure the influence degree of different features on clustering, and rank correlation coefficients of the power transmission capacity of each feature of a sample and a key section of the sample are used as differential measurement values.

In terms of sample distance and similarity measurement, since the dimension of the sample points in the power grid operation mode is higher, euclidean distance is not suitable for distance measurement any more, cosine similarity is used as similarity measurement, so that the method has better effect, but the method cannot be directly applied to a clustering algorithm, because different sample feature vectors have different lengths, the length difference also has an influence on the similarity measurement, and therefore, the cosine similarity is improved, the difference measurement and the difference vector length ratio are introduced, and two sample points are assumed to be x= (x) ₁ ,x ₂ ,…,x _n ) And y= (y) ₁ ,y ₂ ,…,y _n ) All are n-dimensional data points, and the characteristic differentiation measure is r= (r) ₁ ,r ₂ ,…,r _n ) If the length of the x vector of the sample point is smaller than the length of the y vector of the sample point, the improved cosine similarity formula is shown as formula (10):

the K-means++ improves the original K-means algorithm, and adopts different methods when the initial cluster center sample is selected. The K-means adopts a random selection mode, and the method for selecting the initial sample point of the K-means++ comprises the following steps: firstly randomly selecting a sample point to be distributed to a first clustering category as the center of the first clustering category, then when a second sample point is selected, firstly calculating the distance between the rest points and the first point, introducing a differential measurement, then selecting the rest points with the largest value to be distributed to the second clustering category as the center of the second clustering category, then when a third sample point is selected, firstly calculating the center point of the first two sample points, then calculating the distance between the center point and the rest points, and selecting the point with the largest distance value from the rest sample points to be distributed to the third clustering category as the center of the third clustering category. The selection is continued according to this idea until the initial K cluster centers are selected.

The effect of the cluster analysis is also affected by the number K of the cluster categories, and the numerical value is manually specified, so that the problem of poor cluster effect and the like can be caused, therefore, the elbow method is used for selecting the optimal number K of the cluster categories, target formulas are respectively calculated for different K values, the Square Sum of Error (SSE) is used as an index, the K value at which the image starts to be stable is found according to the drawn icon, the K value is just like an elbow, and the point is the optimal K value. Since the data aggregation class becomes finer when the number K of the clustering classes becomes larger, the aggregation degree becomes better for each class, so according to this, it can be known that when the current K value is smaller than the optimal K value, the rate of the aggregation degree in the class becomes faster as the K value becomes larger, so that the slope between two points on the image becomes larger, and when the K value is larger than the optimal K value, the rate of the aggregation degree in the class becomes slower as the K value becomes larger, the slope between two points on the image becomes smaller, and an "elbow" image also appears. The specific formula is shown as formula (11):

Wherein C is _i Is the i-th cluster category representing cluster analysis, x is the sample point in the cluster category, μ _i Is the cluster center of the cluster category.

And (3) discovering and analyzing based on the association relation between the clustering result and the information entropy:

for the clustered results, each power grid operation mode sample obtains a classification label, and the classification labels are classified for specific section power transmission capacity, so that a decision tree can be constructed through the information entropy gain of each feature for classification and division, finally, the influence of different feature disturbance on classification is analyzed through adding randomization treatment to the features, and finally, the association relation of different features for section power transmission capacity can be obtained.

To discover the importance of features by using entropy, it is necessary to implement the feature by using a decision tree. The power grid operation mode features are continuous data, so that the power grid operation mode features are processed by adopting a dichotomy method to calculate the information gain.

The binary method sorts a certain feature of all sample points, after sorting, the feature values adjacent to each other are averaged, if n samples exist, each feature can finally obtain n-1 binary values, the n-1 binary values can be used as dividing threshold points, the information entropy gain value is obtained for each dividing threshold point, which dividing threshold point is selected as the optimal dividing point can be known according to the information entropy gain value, the current optimal dividing feature and the optimal dividing point can be known by carrying out the operation on all the features, and finally the initial decision tree is formed.

The information entropy gain solving formula is shown as formula (12):

the information entropy gain is calculated by performing difference operation on the information entropy change before and after division for a certain division to observe the quantization expression of the certain division for reducing the confusion degree of the sample set, wherein X represents the sample set before division, and X ^k Represents the kth sample subset after division, and H (X) represents the information entropy of the sample set X, where the information entropy formula is shown in formula (13):

wherein p is _k Refers to the proportion of the sample set X with the kth class classification label sample to the whole sample set.

And then constructing a decision tree from the root node, finding out the dividing characteristic with the maximum information entropy gain and the optimal dividing point thereof in the current sample set, iteratively establishing each sub-node of the decision tree, and finally completing the initial decision tree.

The method of pruning adopts a method of limiting the maximum depth, pruning is carried out on nodes exceeding the limit, which is the most direct and effective means, and the score on the training set and the test set is calculated by changing the maximum depth, so that the maximum depth value with the best performance is found, and the determination of the final decision tree is completed.

The formula for obtaining the importance degree of each feature is shown as formula (14):

Wherein e _i The i-th feature is indicated as such,representing the accuracy of the model after randomizing the ith feature of the test set, acc (X _test ) Representing the accuracy of the model to the original test set. The smaller the value of equation (14) is, the smaller the influence of the feature on classification is, and therefore the importance of each feature can be obtained, and the coarse-grained association relation can be obtained.

Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that the different dependent claims and the features described herein may be combined in ways other than as described in the original claims. It is also to be understood that features described in connection with separate embodiments may be used in other described embodiments.

Claims

1. The power grid association relation discovery method based on the clustering and the information entropy is characterized by comprising the following steps:

2. The method for discovering the association relation between the clusters and the electric network based on the information entropy according to claim 1, wherein in S1, eight groups of electric network operation mode characteristics are obtained, and the categories are load level, direct current power, generator startup, generator power, rotary backup, line switching, bus voltage and secondary branch power near a section respectively.

3. The method for finding the association relation between the clusters and the electric network based on the information entropy according to claim 1, wherein in S2, a K-S test method is adopted to perform a normal distribution test on the primary screening features of each group to determine whether the primary screening features have a normal property.

4. The method for finding the association relation between clusters and information entropy based on the power grid of claim 1, wherein the step S2 further comprises the steps of calculating variances for different features of the feature without normalization, directly eliminating the feature with the variance of 0, and calculating rank correlation coefficients among the features in the corresponding feature group after variance screening.

5. The method for finding a grid association relationship based on clustering and information entropy according to claim 1, wherein in S2, a sample distance dist (x _i ,x _j ) The method comprises the following steps:

x ₁ ，…，x _n representing n sample points within a feature set, i=1, 2, …, n;

s21, determining eps and MinPts by adopting a first nearest neighbor method and an average contour coefficient combination method according to a sample distance matrix dist (X);

s24, repeating S22 and S23, if the found sample point x _i Satisfy |N _eps (x _i ) I < MinPts, the sample point x _i As a noise point, if the sample point x _i Can be reached from a certain core point density, then the sample point x _i Into categories of the certain core point.

6. The method for discovering a grid association relationship based on clustering and information entropy according to claim 1, wherein in S3, the cosine similarity d is:

wherein, the two features are respectively x= (x) ₁ ，x ₂ ，...，x _n ) And y= (y) ₁ ，y ₂ ，...，y _n ) All are n-dimensional data points, and the characteristic differentiation measure is r= (r) ₁ ，r ₂ ，...，r _n )。

7. The method for discovering the association relation between the clusters and the electric network based on the information entropy according to claim 1, wherein in S3, the secondary screening features are clustered by using a K-means++ clustering algorithm to obtain classification labels of each feature:

8. The method for discovering the association relation between clusters and information entropy based on the power grid according to claim 1, wherein the method for constructing the decision tree is characterized in that the information entropy gain of classification division is carried out through the characteristics of the improved power grid operation mode:

dividing the characteristics of the improved variety power grid operation mode by adopting a dichotomy, calculating the information entropy gain of a dividing threshold point, and determining the current optimal dividing characteristics and optimal dividing points according to the information entropy gain value to form an initial decision tree;

9. A computer-readable storage device storing a computer program, characterized in that the computer program when executed implements the cluster-and-information-entropy-based grid association relation discovery method according to any one of claims 1 to 8.

10. A cluster and information entropy-based power grid association relation discovery device, comprising a storage device, a processor and a computer program stored in the storage device and executable on the processor, wherein the processor executes the computer program to implement the cluster and information entropy-based power grid association relation discovery method according to any one of claims 1 to 8.