CN112085619A

CN112085619A - Feature selection method for power distribution network data optimization

Info

Publication number: CN112085619A
Application number: CN202010797427.4A
Authority: CN
Inventors: 李帆; 周蓝波; 余捷; 侯仲华; 贝翔飚; 顾珏; 宗卫国; 徐姗姗; 夏子朋
Original assignee: State Grid Shanghai Electric Power Co Ltd
Current assignee: State Grid Shanghai Electric Power Co Ltd
Priority date: 2020-08-10
Filing date: 2020-08-10
Publication date: 2020-12-15

Abstract

A characteristic selection method for power distribution network data optimization belongs to the field of electric power data analysis and processing. According to related data sources, quantifying certain influence factors of the power distribution network fault to enable the certain influence factors to become fault characteristic variables; the data matrix is preprocessed and discretized according to the average value of each fault expression; the output characteristic number n is provided by a user from the outside, and then a category data matrix is input; for the target value A, calculating according to the correlation of each fault through mutual information maximization; then, circularly and repeatedly updating the residual output characteristics; setting the target value B model function as the ratio of the characteristic correlation index and the redundancy index, and maximizing the characteristic correlation index and the redundancy index; and sequentially and circularly sorting the feature vectors with the highest scores until the range of the screened feature set extends to a predetermined limit value, outputting an optimal feature subset, and otherwise, repeating the steps. According to the technical scheme, the complexity of the feature selection method can be effectively reduced, and therefore the accuracy of the classification of the fault data of the power distribution network is improved.

Description

Feature selection method for power distribution network data optimization

Technical Field

The invention relates to a characteristic selection method for power distribution network fault data optimization, in particular to a method for selecting and processing big data by adopting multi-mark characteristics with maximum correlation and minimum redundancy, and belongs to the field of electric power data analysis and processing.

Background

In recent years, data mining technology has been developed and widely used in various industries. The data mining technology can intelligently process large-scale data, discover the implicit rules of historical data and predict unknown events. Therefore, the data mining technology can be used for mining the complex relation between the fault and the influence factors of the fault, and further building a model to predict the fault of the power distribution network.

At present, the data mining technology is also popularized in the field of power distribution network automation transformation, and in many practical applications, a data set stored in a power distribution network database often has thousands or even tens of thousands of features, but not all the features are helpful for finding important information hidden behind the data.

In the classification problem, information for determining the class of the sample is contained in the feature vector of the sample, and the completeness of the sample information, the degree of correlation redundancy between the features and the class directly determine the classification capability of the learning algorithm. The large number of extraneous and redundant features not only reduces the classification ability of the learning algorithm, but also increases the amount of unnecessary work.

The feature selection is to select the features with the strongest correlation for classification, and to remove redundant and invalid features. The feature selection is used as the first step of data processing, the data scale can be reduced for large data, the difficulty of target model learning is reduced, the dimensionality reduction can be performed on high-dimensional data to overcome the dimensionality puzzlement phenomenon, and overfitting of the model is prevented. Particularly, in the learning of high-dimensional data, the difficulty and cost of analyzing and learning the data is exponentially increased relative to the dimension of the data, a complex model must be learned to improve the expression capability of the model, and the exponentially increased data volume is also required to support the learning of the complex model. If the data amount is too small, the model is over-fitted, and the generalization performance of the model is poor.

Feature selection is a most effective means for reducing data dimension and improving learning algorithm popularization capability, and is also an indispensable part of data preprocessing in pattern recognition. By eliminating the features irrelevant to the categories, the problem that most learning algorithms are sensitive to irrelevant redundant features can be solved, so that the algorithms are focused on the useful features, and the capability of deep data mining on useful information is improved.

It is becoming more and more urgent how to reduce the dimensions from the data of a large-scale distribution network in order to obtain effective simplified data. The feature selection is used as a key data analysis method and a preprocessing means, and before knowledge mining is carried out on data, an optimal feature subset is selected from an original data feature set, so that the interference of data noise can be eliminated, redundant and irrelevant features can be eliminated, the complexity of subsequent data processing can be greatly reduced, the running time is reduced, and the accuracy and the effectiveness of data analysis are improved.

However, it is extremely difficult to find the optimal feature set in the huge subset space of the original feature set as the representation of the data. Feature extraction refers to the process of generating a small set of new features by merging or transforming the original types, while in feature selection, the spatial dimension is reduced by selecting the most significant features. The feature selection methods can be divided into four categories: filters, wrappers, embedded and hybrid approaches. The filter method performs a statistical analysis on the feature space to select a discriminative subset of features. The feature selection method should be able to identify and remove as many irrelevant and redundant features as possible. Most feature selection methods can effectively remove irrelevant features, but cannot handle redundant features.

In view of the fact that the average correct prediction rate of a prediction model is reduced due to excessive model input variables, and for possible redundant feature variables and non-strongly correlated variables, establishing a method for selecting and processing big data by using multi-labeled features with maximum correlation and minimum redundancy is a technical problem to be solved urgently in practical work.

Disclosure of Invention

The invention aims to provide a characteristic selection method for power distribution network data optimization. The method adopts an improved maximum correlation minimum redundancy feature selection algorithm, removes irrelevant features by performing correlation analysis on an original feature set, retains strong relevant features, and measures the classification error rate of the selected features through a classifier.

The technical scheme of the invention is as follows: the characteristic selection method for the data optimization of the power distribution network is characterized by comprising the following steps:

according to related data sources, quantifying certain influence factors of the power distribution network fault to enable the certain influence factors to become fault characteristic variables;

the data matrix is preprocessed and discretized according to the average value of each fault expression;

the output characteristic number n is provided by a user from the outside, and then a category data matrix is input;

for the target value A, calculating according to the correlation of each fault through mutual information maximization; then, circularly and repeatedly updating the residual output characteristics;

setting the target value B model function as the ratio of the characteristic correlation index and the redundancy index, and maximizing the characteristic correlation index and the redundancy index;

and sequentially and circularly sorting the feature vectors with the highest scores until the range of the screened feature set extends to a predetermined limit value, outputting an optimal feature subset, and otherwise, repeating the steps.

According to the feature selection method, relevance analysis is performed on an original feature set, irrelevant features are removed, strong relevant features are reserved, the classifier is used for measuring the classification error rate of the selected features, an optimal feature subset with low redundancy among the features and high relevance between the features and predictive variables can be selected, and the introduced weighted relevance coefficient calculation method can measure the relevance among all types of variables.

Further, the category data matrix is C ═ {1,2,3,4,5 ·, C }, the target value a is calculated by mutual information maximization according to the correlation of each fault, and the fault number with the highest correlation score is extracted from the target value a and added to the final solution set;

the correlation algorithm of the fault is as follows:

in the formula, D is a mutual information value between features and categories, c is a category of the data set, and | S | is the number of the feature set.

Further, after the correlation of each fault is calculated by the mutual information maximization, performing cycle iteration on the remaining output characteristics, wherein the redundancy value between the output characteristics and the remaining characteristics is calculated according to the average minimum redundancy value;

the feature selection method requires that the correlation between each feature attribute is minimum, namely, the minimum redundancy principle, which is expressed by minimizing the mutual information between features as follows:

wherein R is the size of mutual information value between the features;

if the output feature subset contains a plurality of features, the average value of the output feature subset is regarded as the redundancy score, and the algorithm is as follows:

where P is the set of output features, x_lTo output the feature vector, x_iIs the ith feature vector.

Furthermore, the target value B model function is set as the ratio of the characteristic correlation index and the redundancy index and is maximized;

after two target values of each feature are calculated, determining non-dominant features;

a reference feature is called a non-dominant trait if the following conditions are met;

(1) if the target value A of the reference feature is greater than or equal to all other future target values A, the target value B of the reference feature is greater than or equal to all other target values B of the other features;

(2) if the target value a of the reference feature is greater than the target value a of all the other features and the target value B of the reference feature is less than the target value B of all the other features, and vice versa.

Further, the feature selection method includes a feature having the largest target value B among the non-dominant features into the output feature set. And searching for the remaining output characteristics by adopting a step-by-step increasing method. And sequentially and circularly sorting the feature vectors with the highest scores until the range of the screened feature set extends to a predetermined limit value, outputting an optimal feature subset, and otherwise, repeating the steps.

The feature selection method comprises the steps of performing correlation analysis on an original feature set, removing irrelevant features, reserving strong relevant features, and performing classification error rate measurement on the selected features through a classifier, so that an optimal feature subset with low redundancy among the features and high correlation degree between the features and a predictive variable can be selected; the complexity of the feature selection method is effectively reduced, and therefore the accuracy of the power distribution network fault data classification is improved.

Compared with the prior art, the invention has the advantages that:

1. according to the technical scheme, relevance analysis is carried out on an original feature set, irrelevant features are removed, strong relevant features are reserved, a classifier is used for carrying out classification error rate measurement on the selected features, and an optimal feature subset with low redundancy among the features and high relevance between the features and a predictive variable can be selected through a feature subset model function;

2. according to the technical scheme, the most effective characteristics can be found out from a plurality of characteristics through characteristic selection and optimization, redundant characteristics and repeated characteristics are eliminated, the complexity of the characteristic selection method can be effectively reduced, and therefore the accuracy of power distribution network fault data classification is improved;

3. according to the technical scheme, an improved maximum correlation minimum redundancy feature selection algorithm is adopted, and a weighting correlation coefficient calculation method is introduced, so that the correlation degree among all types of variables can be measured; the complexity of the feature selection method can be effectively reduced, and therefore the accuracy of the classification of the fault data of the power distribution network is improved.

Drawings

FIG. 1 is a schematic block flow diagram of the process of the present invention;

FIG. 2 is a comparison of failure prediction accuracy for the present invention versus an unoptimized method feature set.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

In fig. 1, the technical scheme of the invention comprises the following steps:

Thus, feature selection and optimization finds the most efficient feature from many features, and rejects redundant features, duplicate features, and the like. The method can effectively reduce the complexity of the feature selection method, thereby improving the accuracy of the classification of the fault data of the power distribution network.

Specifically, the category data matrix in the technical solution of the present invention is C ═ {1,2,3,4,5 ·, C }, and for the target value a, the correlation of each fault is calculated by mutual information maximization, and the fault number with the highest correlation score is extracted from the target value a and added to the final solution set.

The correlation algorithm of the fault is as follows:

And after the correlation of each power distribution network fault is calculated by maximizing mutual information, performing cycle iteration on the residual output characteristics, and calculating the redundancy value between the output characteristics and the residual characteristics according to the average minimum redundancy value.

The minimum correlation between each feature attribute is required, i.e. the minimum redundancy principle, which is expressed by the minimization of mutual information between features as follows:

wherein, R is the mutual information value size between the characteristics.

The objective value B model function is set to the ratio of the characteristic correlation index and the redundancy index and maximized. After the two target values for each feature are calculated, the non-dominant features are then determined. A reference feature is called a non-dominant trait if the following conditions are met:

(1) if the target value A of the reference feature is greater than or equal to all other future target values A, the target value B of the reference feature is greater than or equal to all other target values B of the other features.

From the non-dominant features, the feature having the largest target value B is included in the output feature set. And searching for the remaining output characteristics by adopting a step-by-step increasing method. And sequentially and circularly sorting the feature vectors with the highest scores until the range of the screened feature set extends to a predetermined limit value, outputting an optimal feature subset, and otherwise, repeating the steps.

Obviously, the technical solution of the present invention aims to find a feature set that maximizes mutual information between features and multiple labels, and minimizes mutual information between features.

By implementing the technical scheme of the invention, the most relevant feature set with the least redundancy can be found, and the number of output features is provided by a user; the final feature set is terminated after repeated cycles in turn with the size of the final feature set equal to the user-specified feature limit value, with the most relevant features added to the empty final feature set, and then the features added to each iteration in an incremental manner.

Compared with the prior art, the technical scheme of the invention overcomes the defects that most of feature selection methods can effectively remove irrelevant features but cannot process redundant features, and also keeps the advantages of effectively reducing the complexity of the feature selection method, not reducing the generalization performance of the model and the like, thereby improving the accuracy of the classification of the fault data of the power distribution network.

In the technical scheme of the invention, the pseudo code of the maximum correlation minimum redundancy feature selection algorithm for non-dominant feature selection is as follows.

By analyzing the fault influence factors of the power distribution network, the power distribution network data of 162 feeders is investigated, data required by feeder fault prediction is extracted, 17 of distribution transformation capacity, monthly average load of the feeders, monthly maximum load of the feeders, fault time, N-N month fault number, monthly average air temperature, monthly average high (low) temperature, monthly thunderstorm day grading, monthly windstorm day grading, fuse average operation time, segmented cable average operation time, load switch average operation time, segmented insulated wire length, branch line average operation time, transformer average operation time, cable length, feeder branch line number and the like are selected for fault-related characteristic classification and quality characteristic set sorting, and then fault prediction effect comparison is carried out with fault prediction accuracy under an unoptimized method, as shown in fig. 2.

As can be seen from fig. 2, the two feature orderings of the technical scheme of the invention and the unoptimized method gradually increase the number of feeder fault features, and the fault prediction accuracy of the power distribution network is correspondingly improved. According to the technical scheme, when the number of the fault characteristics is 12, the fault prediction accuracy reaches a peak value, and then slightly decreases with the increase of the number of the characteristic quantities, and finally is maintained near a constant prediction accuracy and is overlapped with a prediction curve of an unoptimized method. This indicates that there are 5 redundant feature quantities in the feature corpus, some of which have even adverse effects on fault diagnosis, but rather reduce the prediction accuracy, and the graph results illustrate the necessity of feature optimization. In addition, the selection input of the preferred features reduces the amount of data in the library, reduces the training time and the running time required by the prediction model, and improves the efficiency of fault prediction.

Most current feature selection methods can effectively remove irrelevant features, but do not handle redundant features well. For example, a popular Relieff feature selection strategy is a random selection example, and a weight is set according to feature correlation of nearest neighbors, Relieff is the most successful strategy in feature selection, but only 3 redundant features can be selected from 17 fault feature classification vectors, and the method is obviously lower than the technical scheme of the invention.

In summary, compared with the prior art, the method can effectively reduce the complexity of the feature selection method, for example, the feature number is directly selected to be 12 according to the peak value of fig. 2, so that the classification accuracy of the fault data of the power distribution network is improved.

The technical scheme of the invention aims to find the characteristic set, so that mutual information between the characteristics and multiple marks is maximized, and the mutual information between the characteristics is minimized.

Therefore, the technical scheme of the invention can find the most relevant feature set with the least redundancy, and the number of output features is provided by a user; the final feature set is terminated after repeated cycles in turn with the size of the final feature set equal to the user-specified feature limit value, with the most relevant features added to the empty final feature set, and then the features added to each iteration in an incremental manner.

The invention can be widely applied to the field of electric power data analysis and processing.

Claims

1. A characteristic selection method for power distribution network data optimization is characterized by comprising the following steps:

2. The feature selection method for power distribution network data optimization according to claim 1, wherein the feature selection method is characterized in that relevance analysis is performed on an original feature set to remove irrelevant features, strong relevant features are reserved, a classifier is used for measuring the classification error rate of the selected features, an optimal feature subset with low redundancy among the features and high relevance between the features and predictive variables can be selected, and the introduced weighted relevance coefficient calculation method can measure the relevance among various types of variables.

3. The feature selection method for power distribution network data optimization according to claim 1, wherein the category data matrix is C {1,2,3,4,5 ·, C }, and for the target value a, the correlation of each fault is calculated by mutual information maximization, and a fault number with the highest correlation score is extracted from the correlation and added to the final solution set;

the correlation algorithm of the fault is as follows:

4. The method of selecting characteristics for power distribution network data optimization according to claim 1, wherein after the correlation of each fault is calculated by the mutual information maximization, the remaining output characteristics are subjected to a loop iteration, where redundancy values between the output characteristics and the remaining characteristics are calculated as an average minimum redundancy value;

wherein R is the size of mutual information value between the features;

where P is the output feature set，x_lTo output the feature vector, x_iIs the ith feature vector.

5. The method of selecting characteristics for power distribution network data optimization according to claim 1, wherein the target value B model function is set to a ratio of a characteristic correlation index and a redundancy index and maximized;

6. The method of selecting characteristics for power distribution network data optimization according to claim 1, wherein the method of selecting characteristics includes characteristics having a maximum target value B from the non-dominant characteristics into the output characteristic set. And searching for the remaining output characteristics by adopting a step-by-step increasing method. And sequentially and circularly sorting the feature vectors with the highest scores until the range of the screened feature set extends to a predetermined limit value, outputting an optimal feature subset, and otherwise, repeating the steps.

7. The feature selection method for power distribution network data optimization according to claim 1, wherein the feature selection method is characterized in that relevance analysis is performed on an original feature set to remove irrelevant features, strong relevant features are reserved, and a classifier is used for performing classification error rate measurement on the selected features to select an optimal feature subset with low redundancy among the features and high relevance between the features and a predictive variable; the complexity of the feature selection method is effectively reduced, and therefore the accuracy of the power distribution network fault data classification is improved.