CN109978039B

CN109978039B - Fan blade icing prediction method based on unbalanced data set

Info

Publication number: CN109978039B
Application number: CN201910207037.4A
Authority: CN
Inventors: 岳东; 葛阳鸣; 卜阳; 宋星星
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2020-10-16
Anticipated expiration: 2039-03-19
Also published as: CN109978039A

Abstract

The invention discloses a method for predicting icing of a fan blade based on unbalanced data set conditions, which enables the distribution of data samples in unbalanced data sets to be balanced and predicts the icing event of the fan blade by combining a random forest algorithm (RF). According to the algorithm, the BIRCH hierarchical clustering operation is firstly carried out on original few types of samples, and different concentration areas are divided in each clustering area according to the density of sample points. The lower concentration areas will synthesize more sample, as opposed to higher concentration areas that will require less sample to be synthesized. In order to follow the original distribution of the few classes of samples, new samples are synthesized separately in different concentration zones within each cluster. And secondly, the BIRCH-SMOTE algorithm improves linear interpolation operation, increases randomness in the interpolation process, and effectively avoids the problem of overlapping redundancy of synthesized samples. And finally, training the random forest model by using the balanced data set, and obtaining a fan blade icing prediction result.

Description

Fan blade icing prediction method based on unbalanced data set

Technical Field

The invention relates to the technical field of short-term wind power generation, in particular to a method for predicting blade icing of a wind driven generator based on an unbalanced data set.

Background

Wind energy is a typical renewable clean energy source and has received much attention worldwide due to its abundance and large-scale exploitation conditions. Wind power generation has overwhelming advantages in the installed capacity of renewable energy power generation in the world. Among the renewable energy sources, wind energy accounts for more than half of the available renewable energy sources, and wind power generation is the most mature technology for utilizing all renewable resources. In recent years, the world wind power generation amount is rapidly increased, and the prospect is bright. By 12 months 2012, the world installed wind power generation capacity has increased from 60GW in 2000 to 282.578 GW. At present, China is the country with the largest wind power generation installed capacity and the fastest development speed in the world. By 2015, the total installed capacity of wind power generation in China reaches 145.1GW, which accounts for about 2.5% of the total installed capacity in China, and the annual growth rate reaches 26.6%.

Wind energy is also facing outstanding problems while developing rapidly. With the increasing height of wind power generators, the blades of the wind power generators are easy to freeze due to the severe cold environment. Blade icing is a global problem in the field of wind power generation. The problems of blade icing, material and structure performance change, load change and the like caused by the low-temperature environment cause great threats to the power generation performance and safe operation of the fan. With the continuous improvement of the design power of the fan, the height of the existing fan is also continuously increased, so that a large number of fans can touch a lower cloud layer in winter and are very easy to freeze in a low-temperature and humid environment. At present, real-time data of fan operation is mainly stored by an SCADA system, and a monitoring means for blade icing faults is mainly to compare deviation between actual power and theoretical power of a fan and trigger alarm and shutdown of the fan when the deviation reaches a certain value. However, when the alarm is triggered, the blade is often frozen in a large area, and the possibility of breaking and damaging the blade can be increased when the fan is operated. Although many new fans have been designed with automatic de-icing systems, the challenge in practical applications is that it is difficult to predict the early stage of icing accurately so that the de-icing system can be turned on as early as possible. Therefore, the prediction accuracy of the icing process determines whether the fan can normally and safely operate.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a wind driven generator blade icing prediction method based on an unbalanced data set, and aims to solve the problem that accuracy of prediction of fan blade icing is insufficient under unbalanced data and conditions.

The technical scheme is as follows: the invention provides an improved SMOTE algorithm (BIRCH-SMOTE), which can balance the distribution of data samples in an unbalanced data set and predict the icing event of a fan blade by combining a random forest algorithm (RF). According to the algorithm, the BIRCH hierarchical clustering operation is firstly carried out on original few types of samples, and different concentration areas are divided in each clustering area according to the density of sample points. The lower concentration areas will synthesize more sample, as opposed to higher concentration areas that will require less sample to be synthesized. In order to follow the original distribution of the few classes of samples, new samples are synthesized separately in different concentration zones within each cluster. And secondly, the BIRCH-SMOTE algorithm improves linear interpolation operation, increases randomness in the interpolation process, and effectively avoids the problem of overlapping redundancy of synthesized samples. And finally, training the random forest model by using the balanced data set, and obtaining a fan blade icing prediction result.

Specifically, the method for predicting the icing of the fan under the unbalanced data set comprises the following steps:

step 1) collecting and sorting historical meteorological data and fan running state data of a wind power plant, and finally storing the sorted data in a database for convenient use in prediction; historical meteorological data of the wind power plant, fan running state data and a prediction target can form a fan historical data training vector; the fan historical data training vector specifically comprises the following dimensions: wind speed, wind power generation rotational speed, ambient temperature, the inside temperature of generator, whether fan blade phenomenon of icing appears can be expressed as:

X＝[v_w,v_g,p,t_e,t_i,f]

wherein v is_wRepresenting wind speed; v. of_gRepresenting the wind power generation speed; p represents a wind power generation rotation speed; t is t_eRepresents the ambient temperature; t is t_iRepresents the generator internal temperature; f represents whether the fan blade is frozen or not;

step 2), fan historical data samples composed of collected and sorted fan historical data training vectors are subjected to 'range method' normalization processing, so that the processed data are more suitable for learning model training; wherein the normalized calculation formula is as follows:

in the formula, X represents a fan historical data sample; x_minRepresenting a minimum value in the fan historical data samples; x_maxRepresenting a maximum value in the fan historical data samples; x_newThe processed fan historical data samples are obtained;

step 3) according to the data distribution condition of a few types of samples in the fan historical data samples and the size of a Calinski-Harabasz Index scoring coefficient under the condition of different clustering numbers; the minority samples are data samples corresponding to the fan icing in the fan historical data samples; firstly, the possible BIRCH clustering number in a set is assumed, and then after the Calinski-Harabasz Index scoring coefficient is determined, the clustering number z with the highest selected scoring coefficient is obtained; Calinski-Harabasz Index coefficient calculation formula is as follows:

wherein B is_zRepresenting a covariance matrix between the classes; w_zRepresenting a covariance matrix inside a certain cluster; tr (-) denotes the trace of the computation matrix; m represents the number of minority samples in the historical data samples of the fan; z represents the highest cluster number; k represents the number of cluster clusters;

step 4) performing normalization processing on fan historical data sample X by using a pole difference method_newAfter z clustering clusters are obtained by using a BIRCH clustering algorithm, the clustering results are stored in a data set D { cluster _1, cluster _2, cluster _3, cluster _4, cluster _5,... and cluster _ z };

step 5) calculating the density value of a few types of sample points in each cluster in the set D according to a density formula; the following definition can be made for the sample point intensity formula: the sample density value refers to a certain minority class of samples X_originThe sum of the distances to the surrounding K nearest neighbor homogeneous samples, Density, can be described as:

wherein d is_iRepresenting the Euclidean distance between two sample points, i represents sample X_originOne of K nearest neighbor homogeneous samples around;

step 6), arranging the minority sample points in all the clustering clusters in a descending order according to the density of the minority sample points; according to the sequencing result, dividing the sample points in the clusters into three concentration areas of high concentration, medium concentration and low concentration equally;

step 7) in each concentration area in each cluster, K nearest neighbor minority samples are searched for the minority sample points, and the selection standard of the nearest neighbors is measured by using an Euclidean distance formula, namely K samples with the minimum Euclidean distance to the K samples around the minority sample points are selected; wherein, sample x_jAnd sample x_iThe Euclidean distance between them is:

and

respectively represent samples x_jAnd x_iThe value in the t dimension; sample x_jCan be expressed as

Wherein represents x_jThere are n dimensions in total; selecting high _ num minority samples from the nearest K minority samples found from the minority sample points for the high concentration area; selecting midle _ num minority samples from the nearest K minority samples found from the minority sample points in the medium concentration area; selecting low-num minority samples from K nearest neighbor minority samples found from the minority sample points for the low-concentration area;

step 8) synthesizing a new sample according to the following formula by using the high _ num, low _ num and middle _ num few class samples selected in the step 7); after synthesis is carried out according to the following formula, each few type sample in the high-concentration area can generate high _ num new sample points, and each few type sample in the corresponding medium-concentration area and low-concentration area can respectively obtain middle _ num and low _ num new sample points;

wherein, X_{new_1}Generating a sample point for the new time; x_origin,iThe ith dimension characteristic of a few types of samples selected from different concentration areas in the step 7); x_neighbor,iRepresents X_origin,iIn step 7), the ith dimension characteristic of a certain adjacent sample point in the minority sample selected from the different concentration areas is obtained; n, n is the total number of dimensions of the sample; rand (0,1) represents a random number between 0 and 1;

step 9) noise sample points may be introduced after interpolation operation is performed on a few types of samples, so that denoising of the synthesized data set is required; judging whether the newly generated sample point is noise or not by identifying the attributes of the adjacent sample points of the newly generated minority samples; scanning all newly generated minority sample points and deleting noise points; the specific noise point identification process is as follows:

e) calculating a newly generated sample point X_{new_1}20 nearest neighbor samples in the fan historical data samples; let m' be the number of samples belonging to the minority class of samples among the 20 nearest neighbor samples;

f) if m' is 20, the newly generated sample point X is determined_{new_1}Is a noise point;

g) if m' is less than or equal to 0 and less than or equal to 10, judging the new sample point X_{new_1}The method is a safe new sample point, is not a noise point and does not perform any operation;

h) if m' is less than or equal to 10 and less than or equal to 20, judging the new sample point X_{new_1}For the dangerous point, some new few sample points need to be generated near this point and added to the DANIn the GER set;

finally, for each newly generated sample point X in the DANGER set_{new_1}Generating a new sample by using a smote algorithm, and removing all noise points; wherein the formula of the smote algorithm can be expressed as:

X_{new_2}＝X_{new_1}+rand(0,1)×(X_j-X_{new_1}),j＝1,2,...,N

in the formula, X_{new_2}Representing newly synthesized minority class samples; x_{new_1}Representing the original few classes of samples used to synthesize the new sample; rand (0,1) represents some random number between 0 and 1; x_jRepresenting the original minority class sample X_{new_1}A randomly selected one of the K neighboring samples in the new few class sample points, N being the total number of the new few class sample points.

The residual new synthesized sample points after denoising are compared with X_newMerging, storing the merging result into a final synthesized sample set;

step 10), training data in the 'final synthetic sample set' by using a random forest model to obtain a fan icing prediction model; the operation data of the No. 2 fan is used as test data to check the effectiveness of the wind driven generator blade icing prediction method based on the unbalanced data set; in the embodiment, AUC is used as an evaluation standard of the fan blade icing prediction effect; AUC is a quantitative measure of the quality of the classifier, and is generally between 0.5 and 1, and a higher AUC value indicates better performance of the classifier, and if the AUC is 0.5, the classification is equivalent to complete random classification.

Has the advantages that: the method utilizes a BIRCH clustering algorithm to cluster original rare class data samples, and divides different concentration areas in each clustering area according to the density of sample points. The lower concentration areas will synthesize more sample, as opposed to higher concentration areas that will require less sample to be synthesized. In order to follow the original distribution of the few classes of samples, new samples are synthesized separately in different concentration zones within each cluster. And secondly, the BIRCH-SMOTE algorithm improves linear interpolation operation, increases randomness in the interpolation process, and effectively avoids the problem of overlapping redundancy of synthesized samples. After a few new samples are synthesized, in order to avoid introducing noise sample points due to sample synthesis operation, the invention carries out denoising processing on the newly synthesized sample points. And finally, training the random forest model by using the balanced data set, and obtaining a fan blade icing prediction result. The BIRCH-SMOTE significantly improves the accuracy of fan blade icing predictions in the case of unbalanced data sets compared to improvements in the standard SMOTE algorithm and other literature.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a graph comparing the operation effects of the embodiment of the present invention (BIRCH-SMOTE algorithm).

Detailed Description

As shown in FIG. 1, the method for predicting the icing condition of the blade of the wind driven generator based on the unbalanced data set comprises the following steps:

step 1) collecting and sorting historical meteorological data and fan running state data of a wind power plant, and finally storing the sorted data in a database for convenient use in prediction; historical meteorological data of the wind power plant, running state data of the fan and a prediction target (whether the fan is frozen) can form a fan historical data training vector; the fan historical data training vector specifically comprises the following dimensions: wind speed, wind power generation rotational speed, ambient temperature, the inside temperature of generator, whether fan blade phenomenon of icing appears can be expressed as:

X＝[v_w,v_g,p,t_e,t_i,f]

wherein v is_wRepresenting wind speed; v. of_gRepresenting the wind power generation speed; p represents a wind power generation rotation speed; t is t_eRepresents the ambient temperature; t is t_iRepresents the generator internal temperature; f represents whether the fan blade is frozen or not; in this embodiment, real-time data of 8-12 month operation of 2 wind power generators in a wind farm in a certain place is collected and collated, and the specific distribution condition of the data is as follows:

step 3) according to the data distribution condition of a few types of samples in the fan historical data samples and the size of a Calinski-Harabasz Index scoring coefficient under the condition of different clustering numbers; the minority samples are data samples corresponding to the fan icing in the fan historical data samples; firstly, the number of possible BIRCH clusters in the set is assumed, which is respectively: 3 cluster clusters, 4 cluster clusters, 5 cluster clusters, 6 cluster clusters, 7 cluster clusters and 8 cluster clusters; then obtaining the number z of the clustering clusters with the highest selection score coefficient after the Calinski-HarabaszIndex score coefficient is determined; Calinski-Harabasz Index coefficient calculation formula is as follows:

and finally, obtaining 6 clusters with the highest selected scoring coefficient after the Calinski-Harabasz Index scoring coefficient is determined, namely Z is 6.

Step 4) normalizing the data sample X subjected to the pole difference method_newAfter the few samples in the data set are clustered into 6 cluster clusters by using a BIRCH clustering algorithm, storing clustering results in a data set D (cluster _1, cluster _2, cluster _3, cluster _4, cluster _5 and cluster _ 6);

wherein d is_iRepresenting the Euclidean distance between two sample points, i represents sample X_originOne of K nearest neighbor homogeneous samples around; through experimental exploration, the K value suitable for the ventilator data in this embodiment should be 15, that is, 15 nearest samples around each minority class sample are selected as the basis for calculating the density value.

step 7) in each concentration area in each cluster, finding 15 nearest neighbor minority samples for the minority sample points, wherein the selection standard of the nearest neighbors is measured by using an Euclidean distance formula, namely, selecting 15 samples with the minimum Euclidean distance to the minority sample points; wherein, sample x_jAnd sample x_iThe Euclidean distance between them is:

and

Wherein represents x_jThere are n dimensions in total; selecting high _ num as 7 minority samples from the 15 nearest neighbor minority samples searched from the minority sample points for the high concentration area; selecting midle _ num as 11 minority samples from the 15 nearest neighbor minority samples found from the minority sample points for the medium concentration area; selecting 13 minority samples from the 15 nearest neighbor minority samples searched from the minority sample points for the low concentration area;

step 8) using the selected 7, 11 and 13 minority samples in the step 7) to synthesize new samples according to the following formula; after synthesis according to the following formula, each few type sample in the high-concentration area can generate high _ num ═ 7 new sample points, and each few type sample in the corresponding medium-concentration area and low-concentration area can respectively obtain middle _ num ═ 11 and low _ num ═ 13 new sample points;

i) calculating a newly generated sample point X_{new_1}20 nearest neighbor samples in the fan historical data samples; let m' be the number of samples belonging to the minority class of samples among the 20 nearest neighbor samples;

j) if m' is 20, the newly generated sample point X is determined_{new_1}Is a noise point;

k) if m' is less than or equal to 0 and less than or equal to 10, judging the new sample point X_{new_1}The method is a safe new sample point, is not a noise point and does not perform any operation;

l) if m' is less than or equal to 20 and is more than or equal to 10, judging the new sample point X_{new_1}Generating some new few sample points near the dangerous point and adding the sample points into the DANGER set;

X_{new_2}＝X_{new_1}+rand(0,1)×(X_j-X_{new_1}),j＝1,2,...,N

Claims

1. A method for predicting icing of a fan blade based on an unbalanced data set is characterized by comprising the following steps: the method comprises the following steps:

step 1) collecting and sorting historical meteorological data and fan running state data of a wind power plant, and finally storing the sorted data in a database for convenient use in prediction; the historical meteorological data of the wind power plant, the fan running state data and the prediction target form a fan historical data training vector, and the fan historical data training vector specifically comprises the following dimensions: wind speed, wind power generation rotational speed, ambient temperature, generator internal temperature, whether the phenomenon of freezing appears in the fan blade are expressed as:

X＝[v_w,v_g,p,t_e,t_i,f]

in the formula, X represents a fan historical data sample; x_minRepresenting a minimum value in the fan historical data samples; x_maxRepresenting historical data samples of a wind turbineMaximum value of (d); x_newThe processed fan historical data samples are obtained;

step 3) according to the data distribution condition of a minority sample in the fan historical data sample and the Calinski-Harabasz Index scoring coefficient under the condition of different clustering numbers, wherein the minority sample is a data sample corresponding to the fan icing in the fan historical data sample; firstly, the BIRCH clustering number in a set is assumed, and then the clustering number z with the highest scoring coefficient is obtained after the Calinski-HarabaszIndex scoring coefficient is determined;

step 4) normalizing the data sample X subjected to the pole difference method_newAfter z clustering clusters are obtained by using a BIRCH clustering algorithm, the clustering results are stored in a data set D { cluster _1, cluster _2, cluster _3, cluster _4, cluster _5,... and cluster _ z };

step 5) calculating the density value of a few types of sample points in each cluster in the set D according to a density formula; the sample point intensity formula is defined as follows: the sample density value refers to a certain minority class of samples X_originThe sum of the distances to the surrounding K nearest neighbor homogeneous samples, Density, is described by the formula:

step 7) in each concentration area in each cluster, K nearest neighbor minority samples are searched for the minority sample points, and high _ num minority samples are selected from the K nearest neighbor minority samples searched for the minority sample points in the high concentration area; selecting midle _ num minority samples from the nearest K minority samples found from the minority sample points in the medium concentration area; selecting low-num minority samples from K nearest neighbor minority samples found from the minority sample points for the low-concentration area;

step 8) synthesizing a new sample according to the following formula by using the high _ num, low _ num and middle _ num few class samples selected in the step 7); after synthesis is carried out according to the following formula, each few type sample in the high-concentration area generates high _ num new sample points, and each few type sample in the corresponding medium-concentration area and low-concentration area respectively obtains middle _ num and low _ num new sample points;

step 9) after interpolation operation is carried out on a few types of samples, noise sample points are introduced, so that the synthesized data set needs to be denoised; judging whether the newly generated sample point is noise or not by identifying the attributes of the adjacent sample points of the newly generated minority samples; scanning all newly generated minority sample points and deleting noise points; the residual new synthesized sample points after denoising and the fan historical data sample X processed in the step 2) are compared_newMerging, storing the merging result into a final synthesized sample set; the specific flow of noise point identification is as follows:

a) calculating a newly generated sample point X_{new_1}20 nearest neighbor samples in the fan historical data samples; let m' be the number of samples belonging to the minority class of samples among the 20 nearest neighbor samples;

b) if m' is 20, the newly generated sample point X is determined_{new_1}Is a noise point;

c) if m' is less than or equal to 0 and less than or equal to 10, judging the new sample point X_{new_1}The method is a safe new sample point, is not a noise point and does not perform any operation;

d) if m' is less than or equal to 10 and less than or equal to 20, judging the new sample point X_{new_1}Generating some new few sample points near the dangerous point and adding the sample points into the DANGER set;

finally, for each newly generated sample point X in the DANGER set_{new_1}Generating a new sample by using a smote algorithm, and removing all noise points; the formula of the smote algorithm is expressed as follows:

X_{new_2}＝X_{new_1}+rand(0,1)×(X_j-X_{new_1}),j＝1,2,...,N

in the formula, X_{new_2}Representing newly synthesized minority class samples; x_{new_1}Representing the original few classes of samples used to synthesize the new sample; rand (0,1) represents some random number between 0 and 1; x_jRepresenting the original minority class sample X_{new_1}A certain sample randomly selected among K neighboring samples in the new minority sample points, N being the total number of the new minority sample points;

step 10), training data in the 'final synthetic sample set' by using a random forest model to obtain a fan icing prediction model; taking the operation data of the No. 2 fan as test data to check the effectiveness of the prediction method; adopting AUC as an evaluation standard of the fan blade icing prediction effect; AUC is a quantitative standard for measuring the quality of the classifier, the value is between 0.5 and 1, the higher AUC value indicates that the performance of the classifier is better, and if the AUC is 0.5, the classification is equivalent to complete random classification.

2. The method of claim 1, wherein the method comprises the steps of: the Calinski-Harabasz Index coefficient calculation formula in the step 3) is as follows:

wherein B is_zRepresenting a covariance matrix between the classes; w_zRepresenting a covariance matrix inside a certain cluster; tr (-) denotes the trace of the computation matrix; m represents the number of minority samples in the historical data samples of the fan; z represents the highest cluster number; k represents the number of cluster clusters.

3. The method of claim 1, wherein the method comprises the steps of: the selection criteria of the neighbors in the step 7) are measured by using an Euclidean distance formula, namely K minority samples with the minimum Euclidean distance around the minority sample points are selected; wherein, sample x_jAnd sample x_iThe Euclidean distance between them is:

and

respectively represent samples x_jAnd x_iThe value in the t dimension; sample x_jIs shown as

Denotes x_jThere are n dimensions.