CN118152773A

CN118152773A - Wind power data dimension reduction method based on paired tag supplementary feature selection

Info

Publication number: CN118152773A
Application number: CN202410258429.4A
Authority: CN
Inventors: 张平; 王云鹤; 王光磊
Original assignee: Hebei University of Technology
Current assignee: Hebei University of Technology
Priority date: 2024-03-07
Filing date: 2024-03-07
Publication date: 2024-06-07

Abstract

The invention discloses a wind power data dimension reduction method based on paired tag supplementary feature selection, which comprises the steps of firstly obtaining wind power data features, and calculating the total amount of classification information provided by candidate features for a tag set; then, calculating the weight of the candidate feature providing classification information for the paired tags, the total amount of classification information provided by the candidate feature for all paired tags, and redundant information between the candidate feature and all selected features; finally, a feature importance evaluation criterion is provided, and an importance score of the candidate feature is calculated according to the criterion; adding the index of the candidate feature with the largest importance score into the selected feature set, and deleting the candidate feature from the wind power data feature set; repeating the steps until the number of the features in the selected feature set reaches a preset value, wherein all the features in the selected feature set are the selected features. The method utilizes the candidate features to provide the classification information weight for the paired tags to reevaluate the total classification information amount provided by each candidate feature for the paired tags, accurately quantifies the relationship between the candidate features and the tags, and improves the accuracy of feature selection.

Description

Wind power data dimension reduction method based on paired tag supplementary feature selection

Technical Field

The invention belongs to the technical field of wind power data processing, and particularly relates to a wind power data dimension reduction method based on paired tag supplementary feature selection.

Background

Feature selection plays a critical role in processing wind power data. In the face of huge data sets, selecting appropriate features can improve the efficiency and accuracy of the prediction model. In wind power data analysis, multi-label feature selection may involve various factors such as wind speed, wind direction, temperature, humidity, mechanical vibration, fan state, power grid data and the like, and by analyzing historical data, features with significant influence on wind power generation can be determined, so that accuracy of a prediction model is improved, and operation and power generation efficiency of a wind power plant are optimized. Accurate feature selection is helpful for establishing a more reliable prediction model, and provides a more reliable decision basis for development and management of the wind power industry.

The common evaluation criteria for multi-label feature selection include various metrics such as distance measurement, fuzzy set theory, information theory, etc. The method has the advantages that firstly, the method has objectivity and quantifiability, and the contribution of the characteristics to the target variable is accurately estimated through a mathematical method, so that the characteristic selection process is more scientific and reliable; secondly, the relevance among the features is considered, and the selection of the features with redundant information is avoided, so that the generalization capability of the model is improved. Thirdly, feature selection based on the information theory can improve the prediction performance and the interpretability of the model while reducing the computational complexity, and provides powerful support for interpretation of model results. Therefore, feature selection based on information theory plays an important role in data analysis and machine learning, and provides an effective path for model establishment and optimization.

For the measurement of the feature correlation, the existing multi-label feature selection method either ignores the information amount provided by the candidate feature to the paired labels or considers the information amount provided by the candidate feature as the paired labels to be the same, which leads to inaccurate feature evaluation and failure to accurately select the feature.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a wind power data dimension reduction method based on paired tag supplement feature selection.

The invention solves the technical problems by adopting the following technical scheme:

the wind power data dimension reduction method based on paired tag supplementary feature selection is characterized by comprising the following steps of:

Step 1: acquiring a wind power data feature set, and initializing the selected feature set;

Step 2: calculating the sum of mutual information between the candidate features and all the tags according to the formula (1) to obtain the total amount of classification information provided by the candidate tags for the tag set;

Wherein CIS (f _k, L) represents the total amount of classification information provided by candidate feature f _k for tag set L, and I (f _k;l_i) represents mutual information between candidate feature f _k and tag L _i;

step 3: calculating weights of candidate features providing classification information for the paired tags according to formula (3);

Wherein w represents the weight of the candidate feature f _k for providing classification information for the paired tag l _i、l_j, I (l _i;l_j;f_k) represents ternary mutual information between the candidate feature f _k and the paired tag l _i、l_j, and H (l _i) and H (l _j) are information entropy of the tag l _i、l_j;

Calculating the total amount of classification information provided by the candidate features for all pairs of tags according to formula (4);

Wherein CID (f _k, L) represents the total amount of classification information provided by candidate feature f _k for all pairs of tags in tag set L, and I (L _j,l_j;f_k) represents joint mutual information between candidate feature f _k and pairs of tags L _i、l_j;

step 4: calculating redundant information between the candidate feature and all selected features according to equation (5);

Wherein RI (f _k, S) represents redundant information between candidate feature f _k and all selected features, S represents a set of selected features, and I (f _k;f_j) represents mutual information between candidate feature f _k and selected feature f _j;

Step 5: calculating an importance score for the candidate feature according to the feature importance assessment criteria of equation (6);

Wherein J (f _k) represents the importance score of the candidate feature f _k;

adding the index of the candidate feature with the largest importance score into the selected feature set, and deleting the candidate feature from the wind power data feature set; repeating the steps until the number of the features in the selected feature set reaches a preset value, wherein all the features in the selected feature set are the selected features.

Compared with the prior art, the invention has the beneficial effects that:

The existing multi-label feature selection method has some limitations when measuring the feature correlation, and neglects the information quantity provided by the candidate features to the paired labels, or considers the information quantity provided by the candidate features to the paired labels to be the same, so that the importance evaluation of the candidate features is inaccurate, and the feature selection effect is not ideal. The method utilizes the candidate features to provide the classification information weight for the paired tags to reevaluate the total classification information provided by each candidate feature for the paired tags, considers the correlation and the importance among the paired tags, and can more accurately quantify the relation between the candidate features and the tags, thereby improving the effect of feature selection and the model performance. The method provided by the invention is beneficial to more accurately identifying the characteristics with key effects on multi-label prediction, and provides a more reliable basis for multi-label data analysis and predictive modeling.

Drawings

Fig. 1 is an overall flow chart of the present invention.

Detailed Description

The following description of specific embodiments is given by way of illustration only and not by way of limitation of the scope of the application.

The invention provides a wind power data dimension reduction method (abbreviated as method, see figure 1) based on paired tag supplementary feature selection, which comprises the following steps:

Step 1: collecting a wind power data characteristic set comprising information such as wind speed, ambient temperature, turbine cabin yaw angle and the like from a wind power plant, wherein the wind power data characteristic set is marked as F= { F ₁,f₂,…,f_n }; wherein f ₁,f₂,…,f_n represents a feature, and n represents the number of features; features in the wind power data feature set F are used as candidate features; initializing a selected feature set S, namely emptying the selected feature set S, and setting the feature quantity of the selected feature set S as K (K < n);

Step 2: the sum of mutual information between the candidate feature and all the tags in the tag set L is calculated, and the specific formula is as follows:

Wherein, CIS (f _k, L) represents the total amount of classification information provided by candidate feature f _k for tag set L, I (f _k;l_i) represents mutual information between candidate feature f _k and tag L _i, and the calculation formula of the mutual information is:

Wherein X, Y denotes different random variables, p (X _s,y_q) denotes a joint distribution function of the random variables X and Y, and p (X _s)、p(y_q) denotes independent probability distribution functions of the random variables X and Y, respectively;

Step 3: calculating the total amount of classification information provided by candidate features for all pairs of tags;

Step 3.1: the candidate feature f _k is calculated to provide the weight of classification information for each group of labels in the label set L, and the specific steps are as follows:

Wherein w represents the weight of candidate feature f _k to provide classification information for paired tag l _i、l_j, and-1 < w <1; i (l _i;l_j;f_k) represents ternary mutual information between candidate feature f _k and paired tag l _i、l_j; h (l _i) and H (l _j) are information entropy of a label l _i、l_j and are used for normalization processing, and the values of the H (l _i) and H (l _j) are non-negative;

From knowledge of information theory correlations ：I(l_i;l_j;f_k)＝I(l_i,l_j;f_k)-I(f_k;l_i)-I(f_k;l_j),, where I (I _i,l_j;f_k) is joint mutual information between candidate feature f _k and paired tag l _i、l_j, for calculating the amount of information provided by the candidate feature for the paired tag; i (f _k;l_i) is the mutual information between candidate feature f _k and tag l _i, I (f _k;l_j) is the mutual information between candidate feature f _k and tag l _j, for calculating the amount of information provided by the candidate feature for a single tag; if 0< w <1, it indicates that I (l _i,l_j;f_k)>I(f_k;l_i)+I(f_k;l_j), i.e., the candidate feature provides a greater amount of information for the paired tag than for the two tags, respectively, so that the weight of I (l _i,l_j;f_k) in the feature importance assessment is to be increased; if-1 < w <0, let I (l _i,l_j;f_k)<I(f_k;l_i)+I(f_k;l_j), the candidate feature, is that the amount of information provided by the paired tag is less than the amount of information provided by the two tags separately, so the weight of I (l _i,l_j;f_k) in the feature importance assessment is reduced.

Step 3.2: the total amount of classification information provided by candidate features for all pairs of tags is calculated as follows:

wherein COD (f _k, L) represents the total amount of classification information provided by the candidate feature f _k for all pairs of tags in the tag set L;

step 4: the redundant information between the candidate feature and all the selected features is calculated, and the specific calculation formula is as follows:

Wherein RI (f _k, S) represents redundant information between candidate feature f _k and all selected features in the set of selected features S, and I (f _k;f_j) represents mutual information between candidate feature f _k and selected feature f _j;

Step 5: based on CIS (f _k,L)、CID(f_k,L)、RI(f_k; S) and a maximum relevant minimum redundancy strategy, a feature importance evaluation criterion is provided, and an importance score of the candidate feature is calculated according to the feature importance evaluation criterion, specifically as follows:

Wherein J (f _k) represents the importance score of the candidate feature f _k;

Evaluating the importance of each candidate feature according to a feature importance evaluation criterion, adding the index of the candidate feature with the largest importance score into the selected feature set S, and deleting the candidate feature from the wind power data feature set F;

And repeating the step until the number of the features in the selected feature set S reaches the preset number, wherein all the selected features in the selected feature set S are finally selected features, and completing wind power data dimension reduction.

The invention is applicable to the prior art where it is not described.

Claims

1. The wind power data dimension reduction method based on paired tag supplementary feature selection is characterized by comprising the following steps of:

Wherein CID (f _k, L) represents the total amount of classification information provided by candidate feature f _k for all pairs of tags in tag set L, and I (L _i,l_j;f_k) represents joint mutual information between candidate feature f _k and pairs of tags L _i、l_j;

Wherein J (f _k) represents the importance score of the candidate feature f _k;