CN109978039B - Fan blade icing prediction method based on unbalanced data set - Google Patents

Fan blade icing prediction method based on unbalanced data set Download PDF

Info

Publication number
CN109978039B
CN109978039B CN201910207037.4A CN201910207037A CN109978039B CN 109978039 B CN109978039 B CN 109978039B CN 201910207037 A CN201910207037 A CN 201910207037A CN 109978039 B CN109978039 B CN 109978039B
Authority
CN
China
Prior art keywords
sample
samples
new
minority
fan
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910207037.4A
Other languages
Chinese (zh)
Other versions
CN109978039A (en
Inventor
岳东
葛阳鸣
卜阳
宋星星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201910207037.4A priority Critical patent/CN109978039B/en
Publication of CN109978039A publication Critical patent/CN109978039A/en
Application granted granted Critical
Publication of CN109978039B publication Critical patent/CN109978039B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Structures Of Non-Positive Displacement Pumps (AREA)

Abstract

The invention discloses a method for predicting icing of a fan blade based on unbalanced data set conditions, which enables the distribution of data samples in unbalanced data sets to be balanced and predicts the icing event of the fan blade by combining a random forest algorithm (RF). According to the algorithm, the BIRCH hierarchical clustering operation is firstly carried out on original few types of samples, and different concentration areas are divided in each clustering area according to the density of sample points. The lower concentration areas will synthesize more sample, as opposed to higher concentration areas that will require less sample to be synthesized. In order to follow the original distribution of the few classes of samples, new samples are synthesized separately in different concentration zones within each cluster. And secondly, the BIRCH-SMOTE algorithm improves linear interpolation operation, increases randomness in the interpolation process, and effectively avoids the problem of overlapping redundancy of synthesized samples. And finally, training the random forest model by using the balanced data set, and obtaining a fan blade icing prediction result.

Description

Fan blade icing prediction method based on unbalanced data set
Technical Field
The invention relates to the technical field of short-term wind power generation, in particular to a method for predicting blade icing of a wind driven generator based on an unbalanced data set.
Background
Wind energy is a typical renewable clean energy source and has received much attention worldwide due to its abundance and large-scale exploitation conditions. Wind power generation has overwhelming advantages in the installed capacity of renewable energy power generation in the world. Among the renewable energy sources, wind energy accounts for more than half of the available renewable energy sources, and wind power generation is the most mature technology for utilizing all renewable resources. In recent years, the world wind power generation amount is rapidly increased, and the prospect is bright. By 12 months 2012, the world installed wind power generation capacity has increased from 60GW in 2000 to 282.578 GW. At present, China is the country with the largest wind power generation installed capacity and the fastest development speed in the world. By 2015, the total installed capacity of wind power generation in China reaches 145.1GW, which accounts for about 2.5% of the total installed capacity in China, and the annual growth rate reaches 26.6%.
Wind energy is also facing outstanding problems while developing rapidly. With the increasing height of wind power generators, the blades of the wind power generators are easy to freeze due to the severe cold environment. Blade icing is a global problem in the field of wind power generation. The problems of blade icing, material and structure performance change, load change and the like caused by the low-temperature environment cause great threats to the power generation performance and safe operation of the fan. With the continuous improvement of the design power of the fan, the height of the existing fan is also continuously increased, so that a large number of fans can touch a lower cloud layer in winter and are very easy to freeze in a low-temperature and humid environment. At present, real-time data of fan operation is mainly stored by an SCADA system, and a monitoring means for blade icing faults is mainly to compare deviation between actual power and theoretical power of a fan and trigger alarm and shutdown of the fan when the deviation reaches a certain value. However, when the alarm is triggered, the blade is often frozen in a large area, and the possibility of breaking and damaging the blade can be increased when the fan is operated. Although many new fans have been designed with automatic de-icing systems, the challenge in practical applications is that it is difficult to predict the early stage of icing accurately so that the de-icing system can be turned on as early as possible. Therefore, the prediction accuracy of the icing process determines whether the fan can normally and safely operate.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides a wind driven generator blade icing prediction method based on an unbalanced data set, and aims to solve the problem that accuracy of prediction of fan blade icing is insufficient under unbalanced data and conditions.
The technical scheme is as follows: the invention provides an improved SMOTE algorithm (BIRCH-SMOTE), which can balance the distribution of data samples in an unbalanced data set and predict the icing event of a fan blade by combining a random forest algorithm (RF). According to the algorithm, the BIRCH hierarchical clustering operation is firstly carried out on original few types of samples, and different concentration areas are divided in each clustering area according to the density of sample points. The lower concentration areas will synthesize more sample, as opposed to higher concentration areas that will require less sample to be synthesized. In order to follow the original distribution of the few classes of samples, new samples are synthesized separately in different concentration zones within each cluster. And secondly, the BIRCH-SMOTE algorithm improves linear interpolation operation, increases randomness in the interpolation process, and effectively avoids the problem of overlapping redundancy of synthesized samples. And finally, training the random forest model by using the balanced data set, and obtaining a fan blade icing prediction result.
Specifically, the method for predicting the icing of the fan under the unbalanced data set comprises the following steps:
step 1) collecting and sorting historical meteorological data and fan running state data of a wind power plant, and finally storing the sorted data in a database for convenient use in prediction; historical meteorological data of the wind power plant, fan running state data and a prediction target can form a fan historical data training vector; the fan historical data training vector specifically comprises the following dimensions: wind speed, wind power generation rotational speed, ambient temperature, the inside temperature of generator, whether fan blade phenomenon of icing appears can be expressed as:
X=[vw,vg,p,te,ti,f]
wherein v iswRepresenting wind speed; v. ofgRepresenting the wind power generation speed; p represents a wind power generation rotation speed; t is teRepresents the ambient temperature; t is tiRepresents the generator internal temperature; f represents whether the fan blade is frozen or not;
step 2), fan historical data samples composed of collected and sorted fan historical data training vectors are subjected to 'range method' normalization processing, so that the processed data are more suitable for learning model training; wherein the normalized calculation formula is as follows:
Figure GDA0002580767200000021
in the formula, X represents a fan historical data sample; xminRepresenting a minimum value in the fan historical data samples; xmaxRepresenting a maximum value in the fan historical data samples; xnewThe processed fan historical data samples are obtained;
step 3) according to the data distribution condition of a few types of samples in the fan historical data samples and the size of a Calinski-Harabasz Index scoring coefficient under the condition of different clustering numbers; the minority samples are data samples corresponding to the fan icing in the fan historical data samples; firstly, the possible BIRCH clustering number in a set is assumed, and then after the Calinski-Harabasz Index scoring coefficient is determined, the clustering number z with the highest selected scoring coefficient is obtained; Calinski-Harabasz Index coefficient calculation formula is as follows:
Figure GDA0002580767200000031
wherein B iszRepresenting a covariance matrix between the classes; wzRepresenting a covariance matrix inside a certain cluster; tr (-) denotes the trace of the computation matrix; m represents the number of minority samples in the historical data samples of the fan; z represents the highest cluster number; k represents the number of cluster clusters;
step 4) performing normalization processing on fan historical data sample X by using a pole difference methodnewAfter z clustering clusters are obtained by using a BIRCH clustering algorithm, the clustering results are stored in a data set D { cluster _1, cluster _2, cluster _3, cluster _4, cluster _5,... and cluster _ z };
step 5) calculating the density value of a few types of sample points in each cluster in the set D according to a density formula; the following definition can be made for the sample point intensity formula: the sample density value refers to a certain minority class of samples XoriginThe sum of the distances to the surrounding K nearest neighbor homogeneous samples, Density, can be described as:
Figure GDA0002580767200000032
wherein d isiRepresenting the Euclidean distance between two sample points, i represents sample XoriginOne of K nearest neighbor homogeneous samples around;
step 6), arranging the minority sample points in all the clustering clusters in a descending order according to the density of the minority sample points; according to the sequencing result, dividing the sample points in the clusters into three concentration areas of high concentration, medium concentration and low concentration equally;
step 7) in each concentration area in each cluster, K nearest neighbor minority samples are searched for the minority sample points, and the selection standard of the nearest neighbors is measured by using an Euclidean distance formula, namely K samples with the minimum Euclidean distance to the K samples around the minority sample points are selected; wherein, sample xjAnd sample xiThe Euclidean distance between them is:
Figure GDA0002580767200000033
Figure GDA0002580767200000041
and
Figure GDA0002580767200000044
respectively represent samples xjAnd xiThe value in the t dimension; sample xjCan be expressed as
Figure GDA0002580767200000042
Wherein represents xjThere are n dimensions in total; selecting high _ num minority samples from the nearest K minority samples found from the minority sample points for the high concentration area; selecting midle _ num minority samples from the nearest K minority samples found from the minority sample points in the medium concentration area; selecting low-num minority samples from K nearest neighbor minority samples found from the minority sample points for the low-concentration area;
step 8) synthesizing a new sample according to the following formula by using the high _ num, low _ num and middle _ num few class samples selected in the step 7); after synthesis is carried out according to the following formula, each few type sample in the high-concentration area can generate high _ num new sample points, and each few type sample in the corresponding medium-concentration area and low-concentration area can respectively obtain middle _ num and low _ num new sample points;
Figure GDA0002580767200000043
wherein, Xnew_1Generating a sample point for the new time; xorigin,iThe ith dimension characteristic of a few types of samples selected from different concentration areas in the step 7); xneighbor,iRepresents Xorigin,iIn step 7), the ith dimension characteristic of a certain adjacent sample point in the minority sample selected from the different concentration areas is obtained; n, n is the total number of dimensions of the sample; rand (0,1) represents a random number between 0 and 1;
step 9) noise sample points may be introduced after interpolation operation is performed on a few types of samples, so that denoising of the synthesized data set is required; judging whether the newly generated sample point is noise or not by identifying the attributes of the adjacent sample points of the newly generated minority samples; scanning all newly generated minority sample points and deleting noise points; the specific noise point identification process is as follows:
e) calculating a newly generated sample point Xnew_120 nearest neighbor samples in the fan historical data samples; let m' be the number of samples belonging to the minority class of samples among the 20 nearest neighbor samples;
f) if m' is 20, the newly generated sample point X is determinednew_1Is a noise point;
g) if m' is less than or equal to 0 and less than or equal to 10, judging the new sample point Xnew_1The method is a safe new sample point, is not a noise point and does not perform any operation;
h) if m' is less than or equal to 10 and less than or equal to 20, judging the new sample point Xnew_1For the dangerous point, some new few sample points need to be generated near this point and added to the DANIn the GER set;
finally, for each newly generated sample point X in the DANGER setnew_1Generating a new sample by using a smote algorithm, and removing all noise points; wherein the formula of the smote algorithm can be expressed as:
Xnew_2=Xnew_1+rand(0,1)×(Xj-Xnew_1),j=1,2,...,N
in the formula, Xnew_2Representing newly synthesized minority class samples; xnew_1Representing the original few classes of samples used to synthesize the new sample; rand (0,1) represents some random number between 0 and 1; xjRepresenting the original minority class sample Xnew_1A randomly selected one of the K neighboring samples in the new few class sample points, N being the total number of the new few class sample points.
The residual new synthesized sample points after denoising are compared with XnewMerging, storing the merging result into a final synthesized sample set;
step 10), training data in the 'final synthetic sample set' by using a random forest model to obtain a fan icing prediction model; the operation data of the No. 2 fan is used as test data to check the effectiveness of the wind driven generator blade icing prediction method based on the unbalanced data set; in the embodiment, AUC is used as an evaluation standard of the fan blade icing prediction effect; AUC is a quantitative measure of the quality of the classifier, and is generally between 0.5 and 1, and a higher AUC value indicates better performance of the classifier, and if the AUC is 0.5, the classification is equivalent to complete random classification.
Has the advantages that: the method utilizes a BIRCH clustering algorithm to cluster original rare class data samples, and divides different concentration areas in each clustering area according to the density of sample points. The lower concentration areas will synthesize more sample, as opposed to higher concentration areas that will require less sample to be synthesized. In order to follow the original distribution of the few classes of samples, new samples are synthesized separately in different concentration zones within each cluster. And secondly, the BIRCH-SMOTE algorithm improves linear interpolation operation, increases randomness in the interpolation process, and effectively avoids the problem of overlapping redundancy of synthesized samples. After a few new samples are synthesized, in order to avoid introducing noise sample points due to sample synthesis operation, the invention carries out denoising processing on the newly synthesized sample points. And finally, training the random forest model by using the balanced data set, and obtaining a fan blade icing prediction result. The BIRCH-SMOTE significantly improves the accuracy of fan blade icing predictions in the case of unbalanced data sets compared to improvements in the standard SMOTE algorithm and other literature.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a graph comparing the operation effects of the embodiment of the present invention (BIRCH-SMOTE algorithm).
Detailed Description
As shown in FIG. 1, the method for predicting the icing condition of the blade of the wind driven generator based on the unbalanced data set comprises the following steps:
step 1) collecting and sorting historical meteorological data and fan running state data of a wind power plant, and finally storing the sorted data in a database for convenient use in prediction; historical meteorological data of the wind power plant, running state data of the fan and a prediction target (whether the fan is frozen) can form a fan historical data training vector; the fan historical data training vector specifically comprises the following dimensions: wind speed, wind power generation rotational speed, ambient temperature, the inside temperature of generator, whether fan blade phenomenon of icing appears can be expressed as:
X=[vw,vg,p,te,ti,f]
wherein v iswRepresenting wind speed; v. ofgRepresenting the wind power generation speed; p represents a wind power generation rotation speed; t is teRepresents the ambient temperature; t is tiRepresents the generator internal temperature; f represents whether the fan blade is frozen or not; in this embodiment, real-time data of 8-12 month operation of 2 wind power generators in a wind farm in a certain place is collected and collated, and the specific distribution condition of the data is as follows:
Figure GDA0002580767200000061
step 2), fan historical data samples composed of collected and sorted fan historical data training vectors are subjected to 'range method' normalization processing, so that the processed data are more suitable for learning model training; wherein the normalized calculation formula is as follows:
Figure GDA0002580767200000062
in the formula, X represents a fan historical data sample; xminRepresenting a minimum value in the fan historical data samples; xmaxRepresenting a maximum value in the fan historical data samples; xnewThe processed fan historical data samples are obtained;
step 3) according to the data distribution condition of a few types of samples in the fan historical data samples and the size of a Calinski-Harabasz Index scoring coefficient under the condition of different clustering numbers; the minority samples are data samples corresponding to the fan icing in the fan historical data samples; firstly, the number of possible BIRCH clusters in the set is assumed, which is respectively: 3 cluster clusters, 4 cluster clusters, 5 cluster clusters, 6 cluster clusters, 7 cluster clusters and 8 cluster clusters; then obtaining the number z of the clustering clusters with the highest selection score coefficient after the Calinski-HarabaszIndex score coefficient is determined; Calinski-Harabasz Index coefficient calculation formula is as follows:
Figure GDA0002580767200000071
wherein B iszRepresenting a covariance matrix between the classes; wzRepresenting a covariance matrix inside a certain cluster; tr (-) denotes the trace of the computation matrix; m represents the number of minority samples in the historical data samples of the fan; z represents the highest cluster number; k represents the number of cluster clusters;
and finally, obtaining 6 clusters with the highest selected scoring coefficient after the Calinski-Harabasz Index scoring coefficient is determined, namely Z is 6.
Step 4) normalizing the data sample X subjected to the pole difference methodnewAfter the few samples in the data set are clustered into 6 cluster clusters by using a BIRCH clustering algorithm, storing clustering results in a data set D (cluster _1, cluster _2, cluster _3, cluster _4, cluster _5 and cluster _ 6);
step 5) calculating the density value of a few types of sample points in each cluster in the set D according to a density formula; the following definition can be made for the sample point intensity formula: the sample density value refers to a certain minority class of samples XoriginThe sum of the distances to the surrounding K nearest neighbor homogeneous samples, Density, can be described as:
Figure GDA0002580767200000072
wherein d isiRepresenting the Euclidean distance between two sample points, i represents sample XoriginOne of K nearest neighbor homogeneous samples around; through experimental exploration, the K value suitable for the ventilator data in this embodiment should be 15, that is, 15 nearest samples around each minority class sample are selected as the basis for calculating the density value.
Step 6), arranging the minority sample points in all the clustering clusters in a descending order according to the density of the minority sample points; according to the sequencing result, dividing the sample points in the clusters into three concentration areas of high concentration, medium concentration and low concentration equally;
step 7) in each concentration area in each cluster, finding 15 nearest neighbor minority samples for the minority sample points, wherein the selection standard of the nearest neighbors is measured by using an Euclidean distance formula, namely, selecting 15 samples with the minimum Euclidean distance to the minority sample points; wherein, sample xjAnd sample xiThe Euclidean distance between them is:
Figure GDA0002580767200000081
Figure GDA0002580767200000082
and
Figure GDA0002580767200000083
respectively represent samples xjAnd xiThe value in the t dimension; sample xjCan be expressed as
Figure GDA0002580767200000084
Wherein represents xjThere are n dimensions in total; selecting high _ num as 7 minority samples from the 15 nearest neighbor minority samples searched from the minority sample points for the high concentration area; selecting midle _ num as 11 minority samples from the 15 nearest neighbor minority samples found from the minority sample points for the medium concentration area; selecting 13 minority samples from the 15 nearest neighbor minority samples searched from the minority sample points for the low concentration area;
step 8) using the selected 7, 11 and 13 minority samples in the step 7) to synthesize new samples according to the following formula; after synthesis according to the following formula, each few type sample in the high-concentration area can generate high _ num ═ 7 new sample points, and each few type sample in the corresponding medium-concentration area and low-concentration area can respectively obtain middle _ num ═ 11 and low _ num ═ 13 new sample points;
Figure GDA0002580767200000085
wherein, Xnew_1Generating a sample point for the new time; xorigin,iThe ith dimension characteristic of a few types of samples selected from different concentration areas in the step 7); xneighbor,iRepresents Xorigin,iIn step 7), the ith dimension characteristic of a certain adjacent sample point in the minority sample selected from the different concentration areas is obtained; n, n is the total number of dimensions of the sample; rand (0,1) represents a random number between 0 and 1;
step 9) noise sample points may be introduced after interpolation operation is performed on a few types of samples, so that denoising of the synthesized data set is required; judging whether the newly generated sample point is noise or not by identifying the attributes of the adjacent sample points of the newly generated minority samples; scanning all newly generated minority sample points and deleting noise points; the specific noise point identification process is as follows:
i) calculating a newly generated sample point Xnew_120 nearest neighbor samples in the fan historical data samples; let m' be the number of samples belonging to the minority class of samples among the 20 nearest neighbor samples;
j) if m' is 20, the newly generated sample point X is determinednew_1Is a noise point;
k) if m' is less than or equal to 0 and less than or equal to 10, judging the new sample point Xnew_1The method is a safe new sample point, is not a noise point and does not perform any operation;
l) if m' is less than or equal to 20 and is more than or equal to 10, judging the new sample point Xnew_1Generating some new few sample points near the dangerous point and adding the sample points into the DANGER set;
finally, for each newly generated sample point X in the DANGER setnew_1Generating a new sample by using a smote algorithm, and removing all noise points; wherein the formula of the smote algorithm can be expressed as:
Xnew_2=Xnew_1+rand(0,1)×(Xj-Xnew_1),j=1,2,...,N
in the formula, Xnew_2Representing newly synthesized minority class samples; xnew_1Representing the original few classes of samples used to synthesize the new sample; rand (0,1) represents some random number between 0 and 1; xjRepresenting the original minority class sample Xnew_1A randomly selected one of the K neighboring samples in the new few class sample points, N being the total number of the new few class sample points.
The residual new synthesized sample points after denoising are compared with XnewMerging, storing the merging result into a final synthesized sample set;
step 10), training data in the 'final synthetic sample set' by using a random forest model to obtain a fan icing prediction model; the operation data of the No. 2 fan is used as test data to check the effectiveness of the wind driven generator blade icing prediction method based on the unbalanced data set; in the embodiment, AUC is used as an evaluation standard of the fan blade icing prediction effect; AUC is a quantitative measure of the quality of the classifier, and is generally between 0.5 and 1, and a higher AUC value indicates better performance of the classifier, and if the AUC is 0.5, the classification is equivalent to complete random classification.

Claims (3)

1. A method for predicting icing of a fan blade based on an unbalanced data set is characterized by comprising the following steps: the method comprises the following steps:
step 1) collecting and sorting historical meteorological data and fan running state data of a wind power plant, and finally storing the sorted data in a database for convenient use in prediction; the historical meteorological data of the wind power plant, the fan running state data and the prediction target form a fan historical data training vector, and the fan historical data training vector specifically comprises the following dimensions: wind speed, wind power generation rotational speed, ambient temperature, generator internal temperature, whether the phenomenon of freezing appears in the fan blade are expressed as:
X=[vw,vg,p,te,ti,f]
wherein v iswRepresenting wind speed; v. ofgRepresenting the wind power generation speed; p represents a wind power generation rotation speed; t is teRepresents the ambient temperature; t is tiRepresents the generator internal temperature; f represents whether the fan blade is frozen or not;
step 2), fan historical data samples composed of collected and sorted fan historical data training vectors are subjected to 'range method' normalization processing, so that the processed data are more suitable for learning model training; wherein the normalized calculation formula is as follows:
Figure FDA0002580767190000011
in the formula, X represents a fan historical data sample; xminRepresenting a minimum value in the fan historical data samples; xmaxRepresenting historical data samples of a wind turbineMaximum value of (d); xnewThe processed fan historical data samples are obtained;
step 3) according to the data distribution condition of a minority sample in the fan historical data sample and the Calinski-Harabasz Index scoring coefficient under the condition of different clustering numbers, wherein the minority sample is a data sample corresponding to the fan icing in the fan historical data sample; firstly, the BIRCH clustering number in a set is assumed, and then the clustering number z with the highest scoring coefficient is obtained after the Calinski-HarabaszIndex scoring coefficient is determined;
step 4) normalizing the data sample X subjected to the pole difference methodnewAfter z clustering clusters are obtained by using a BIRCH clustering algorithm, the clustering results are stored in a data set D { cluster _1, cluster _2, cluster _3, cluster _4, cluster _5,... and cluster _ z };
step 5) calculating the density value of a few types of sample points in each cluster in the set D according to a density formula; the sample point intensity formula is defined as follows: the sample density value refers to a certain minority class of samples XoriginThe sum of the distances to the surrounding K nearest neighbor homogeneous samples, Density, is described by the formula:
Figure FDA0002580767190000021
wherein d isiRepresenting the Euclidean distance between two sample points, i represents sample XoriginOne of K nearest neighbor homogeneous samples around;
step 6), arranging the minority sample points in all the clustering clusters in a descending order according to the density of the minority sample points; according to the sequencing result, dividing the sample points in the clusters into three concentration areas of high concentration, medium concentration and low concentration equally;
step 7) in each concentration area in each cluster, K nearest neighbor minority samples are searched for the minority sample points, and high _ num minority samples are selected from the K nearest neighbor minority samples searched for the minority sample points in the high concentration area; selecting midle _ num minority samples from the nearest K minority samples found from the minority sample points in the medium concentration area; selecting low-num minority samples from K nearest neighbor minority samples found from the minority sample points for the low-concentration area;
step 8) synthesizing a new sample according to the following formula by using the high _ num, low _ num and middle _ num few class samples selected in the step 7); after synthesis is carried out according to the following formula, each few type sample in the high-concentration area generates high _ num new sample points, and each few type sample in the corresponding medium-concentration area and low-concentration area respectively obtains middle _ num and low _ num new sample points;
Figure FDA0002580767190000022
wherein, Xnew_1Generating a sample point for the new time; xorigin,iThe ith dimension characteristic of a few types of samples selected from different concentration areas in the step 7); xneighbor,iRepresents Xorigin,iIn step 7), the ith dimension characteristic of a certain adjacent sample point in the minority sample selected from the different concentration areas is obtained; n, n is the total number of dimensions of the sample; rand (0,1) represents a random number between 0 and 1;
step 9) after interpolation operation is carried out on a few types of samples, noise sample points are introduced, so that the synthesized data set needs to be denoised; judging whether the newly generated sample point is noise or not by identifying the attributes of the adjacent sample points of the newly generated minority samples; scanning all newly generated minority sample points and deleting noise points; the residual new synthesized sample points after denoising and the fan historical data sample X processed in the step 2) are comparednewMerging, storing the merging result into a final synthesized sample set; the specific flow of noise point identification is as follows:
a) calculating a newly generated sample point Xnew_120 nearest neighbor samples in the fan historical data samples; let m' be the number of samples belonging to the minority class of samples among the 20 nearest neighbor samples;
b) if m' is 20, the newly generated sample point X is determinednew_1Is a noise point;
c) if m' is less than or equal to 0 and less than or equal to 10, judging the new sample point Xnew_1The method is a safe new sample point, is not a noise point and does not perform any operation;
d) if m' is less than or equal to 10 and less than or equal to 20, judging the new sample point Xnew_1Generating some new few sample points near the dangerous point and adding the sample points into the DANGER set;
finally, for each newly generated sample point X in the DANGER setnew_1Generating a new sample by using a smote algorithm, and removing all noise points; the formula of the smote algorithm is expressed as follows:
Xnew_2=Xnew_1+rand(0,1)×(Xj-Xnew_1),j=1,2,...,N
in the formula, Xnew_2Representing newly synthesized minority class samples; xnew_1Representing the original few classes of samples used to synthesize the new sample; rand (0,1) represents some random number between 0 and 1; xjRepresenting the original minority class sample Xnew_1A certain sample randomly selected among K neighboring samples in the new minority sample points, N being the total number of the new minority sample points;
step 10), training data in the 'final synthetic sample set' by using a random forest model to obtain a fan icing prediction model; taking the operation data of the No. 2 fan as test data to check the effectiveness of the prediction method; adopting AUC as an evaluation standard of the fan blade icing prediction effect; AUC is a quantitative standard for measuring the quality of the classifier, the value is between 0.5 and 1, the higher AUC value indicates that the performance of the classifier is better, and if the AUC is 0.5, the classification is equivalent to complete random classification.
2. The method of claim 1, wherein the method comprises the steps of: the Calinski-Harabasz Index coefficient calculation formula in the step 3) is as follows:
Figure FDA0002580767190000041
wherein B iszRepresenting a covariance matrix between the classes; wzRepresenting a covariance matrix inside a certain cluster; tr (-) denotes the trace of the computation matrix; m represents the number of minority samples in the historical data samples of the fan; z represents the highest cluster number; k represents the number of cluster clusters.
3. The method of claim 1, wherein the method comprises the steps of: the selection criteria of the neighbors in the step 7) are measured by using an Euclidean distance formula, namely K minority samples with the minimum Euclidean distance around the minority sample points are selected; wherein, sample xjAnd sample xiThe Euclidean distance between them is:
Figure FDA0002580767190000042
Figure FDA0002580767190000043
and
Figure FDA0002580767190000044
respectively represent samples xjAnd xiThe value in the t dimension; sample xjIs shown as
Figure FDA0002580767190000045
Denotes xjThere are n dimensions.
CN201910207037.4A 2019-03-19 2019-03-19 Fan blade icing prediction method based on unbalanced data set Active CN109978039B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910207037.4A CN109978039B (en) 2019-03-19 2019-03-19 Fan blade icing prediction method based on unbalanced data set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910207037.4A CN109978039B (en) 2019-03-19 2019-03-19 Fan blade icing prediction method based on unbalanced data set

Publications (2)

Publication Number Publication Date
CN109978039A CN109978039A (en) 2019-07-05
CN109978039B true CN109978039B (en) 2020-10-16

Family

ID=67079465

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910207037.4A Active CN109978039B (en) 2019-03-19 2019-03-19 Fan blade icing prediction method based on unbalanced data set

Country Status (1)

Country Link
CN (1) CN109978039B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110985315A (en) * 2019-12-16 2020-04-10 南京松数科技有限公司 Early prediction method for detecting icing of fan blade
CN111242206B (en) * 2020-01-08 2022-06-17 吉林大学 High-resolution ocean water temperature calculation method based on hierarchical clustering and random forests
CN111310785A (en) * 2020-01-15 2020-06-19 杭州华网信息技术有限公司 National power grid mechanical external damage prediction method
CN112465245A (en) * 2020-12-04 2021-03-09 复旦大学青岛研究院 Product quality prediction method for unbalanced data set
CN114330881A (en) * 2021-12-29 2022-04-12 南京邮电大学 Data-driven fan blade icing prediction method and device
CN117892213B (en) * 2024-03-18 2024-06-25 中国水利水电第十四工程局有限公司 Diagnosis method for icing detection and early warning of wind driven generator blade

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080059392A1 (en) * 1998-05-01 2008-03-06 Stephen Barnhill System for providing data analysis services using a support vector machine for processing data received from a remote source
CN109086793A (en) * 2018-06-27 2018-12-25 东北大学 A kind of abnormality recognition method of wind-driven generator

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080059392A1 (en) * 1998-05-01 2008-03-06 Stephen Barnhill System for providing data analysis services using a support vector machine for processing data received from a remote source
CN109086793A (en) * 2018-06-27 2018-12-25 东北大学 A kind of abnormality recognition method of wind-driven generator

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests;Li Ma等;《BMC Bioinformatics》;20171231;第1-18页 *
Prediction of Wind Turbine Blades Icing Based on MKB-SMOTE and Random Forest in Imbalanced Data Set;Yangming Ge等;《2017 IEEE Conference on Energy Internet and Energy System Integration》;20180104;第1-6页 *
不平衡数据处理;无名小卒917;《360个人图书馆》;20170110;第1-7页 *
基于半监督谱聚类集成的售后客户细分;杨静雅等;《计算机工程与应用》;20190222;第56卷(第02期);第266-271页 *

Also Published As

Publication number Publication date
CN109978039A (en) 2019-07-05

Similar Documents

Publication Publication Date Title
CN109978039B (en) Fan blade icing prediction method based on unbalanced data set
CN109958588B (en) Icing prediction method, icing prediction device, storage medium, model generation method and model generation device
CN110006649B (en) Bearing fault diagnosis method based on improved ant lion algorithm and support vector machine
CN109751206B (en) Fan blade icing fault prediction method and device and storage medium
JP6759966B2 (en) How to operate the photovoltaic power generation system
Xu et al. Predicting fan blade icing by using particle swarm optimization and support vector machine algorithm
CN106248368B (en) Combustion engine turbine blade fault detection method based on deep learning
CN110750524A (en) Method and system for determining fault characteristics of active power distribution network
CN106779200A (en) Based on the Wind turbines trend prediction method for carrying out similarity in the historical data
Ge et al. Prediction of wind turbine blades icing based on MBK-SMOTE and random forest in imbalanced data set
CN104299044A (en) Clustering-analysis-based wind power short-term prediction system and prediction method
CN107944622A (en) Wind power forecasting method based on continuous time cluster
Gagne et al. Classification of convective areas using decision trees
CN103955521B (en) Cluster classification method for wind power plant
CN111931851B (en) Fan blade icing fault diagnosis method based on one-dimensional residual neural network
CN113689053B (en) Strong convection weather overhead line power failure prediction method based on random forest
CN116050666B (en) Photovoltaic power generation power prediction method for irradiation characteristic clustering
CN112832960A (en) Fan blade icing detection method based on deep learning and storage medium
Li et al. Prediction of wind turbine blades icing based on CJBM with imbalanced data
CN114330881A (en) Data-driven fan blade icing prediction method and device
CN112211794B (en) Cabin temperature abnormity early warning method, device, equipment and medium of wind turbine generator
Ma et al. Anomaly Detection of Mountain Photovoltaic Power Plant Based on Spectral Clustering
CN114994451B (en) Ship electrical equipment fault detection method and system
CN116541780A (en) Power transmission line galloping early warning method, device, equipment and storage medium
CN116663393A (en) Random forest-based power distribution network continuous high-temperature fault risk level prediction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant