CN109978039B - Fan blade icing prediction method based on unbalanced data set - Google Patents
Fan blade icing prediction method based on unbalanced data set Download PDFInfo
- Publication number
- CN109978039B CN109978039B CN201910207037.4A CN201910207037A CN109978039B CN 109978039 B CN109978039 B CN 109978039B CN 201910207037 A CN201910207037 A CN 201910207037A CN 109978039 B CN109978039 B CN 109978039B
- Authority
- CN
- China
- Prior art keywords
- sample
- samples
- new
- minority
- fan
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2148—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/231—Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24147—Distances to closest patterns, e.g. nearest neighbour classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Structures Of Non-Positive Displacement Pumps (AREA)
Abstract
The invention discloses a method for predicting icing of a fan blade based on unbalanced data set conditions, which enables the distribution of data samples in unbalanced data sets to be balanced and predicts the icing event of the fan blade by combining a random forest algorithm (RF). According to the algorithm, the BIRCH hierarchical clustering operation is firstly carried out on original few types of samples, and different concentration areas are divided in each clustering area according to the density of sample points. The lower concentration areas will synthesize more sample, as opposed to higher concentration areas that will require less sample to be synthesized. In order to follow the original distribution of the few classes of samples, new samples are synthesized separately in different concentration zones within each cluster. And secondly, the BIRCH-SMOTE algorithm improves linear interpolation operation, increases randomness in the interpolation process, and effectively avoids the problem of overlapping redundancy of synthesized samples. And finally, training the random forest model by using the balanced data set, and obtaining a fan blade icing prediction result.
Description
Technical Field
The invention relates to the technical field of short-term wind power generation, in particular to a method for predicting blade icing of a wind driven generator based on an unbalanced data set.
Background
Wind energy is a typical renewable clean energy source and has received much attention worldwide due to its abundance and large-scale exploitation conditions. Wind power generation has overwhelming advantages in the installed capacity of renewable energy power generation in the world. Among the renewable energy sources, wind energy accounts for more than half of the available renewable energy sources, and wind power generation is the most mature technology for utilizing all renewable resources. In recent years, the world wind power generation amount is rapidly increased, and the prospect is bright. By 12 months 2012, the world installed wind power generation capacity has increased from 60GW in 2000 to 282.578 GW. At present, China is the country with the largest wind power generation installed capacity and the fastest development speed in the world. By 2015, the total installed capacity of wind power generation in China reaches 145.1GW, which accounts for about 2.5% of the total installed capacity in China, and the annual growth rate reaches 26.6%.
Wind energy is also facing outstanding problems while developing rapidly. With the increasing height of wind power generators, the blades of the wind power generators are easy to freeze due to the severe cold environment. Blade icing is a global problem in the field of wind power generation. The problems of blade icing, material and structure performance change, load change and the like caused by the low-temperature environment cause great threats to the power generation performance and safe operation of the fan. With the continuous improvement of the design power of the fan, the height of the existing fan is also continuously increased, so that a large number of fans can touch a lower cloud layer in winter and are very easy to freeze in a low-temperature and humid environment. At present, real-time data of fan operation is mainly stored by an SCADA system, and a monitoring means for blade icing faults is mainly to compare deviation between actual power and theoretical power of a fan and trigger alarm and shutdown of the fan when the deviation reaches a certain value. However, when the alarm is triggered, the blade is often frozen in a large area, and the possibility of breaking and damaging the blade can be increased when the fan is operated. Although many new fans have been designed with automatic de-icing systems, the challenge in practical applications is that it is difficult to predict the early stage of icing accurately so that the de-icing system can be turned on as early as possible. Therefore, the prediction accuracy of the icing process determines whether the fan can normally and safely operate.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides a wind driven generator blade icing prediction method based on an unbalanced data set, and aims to solve the problem that accuracy of prediction of fan blade icing is insufficient under unbalanced data and conditions.
The technical scheme is as follows: the invention provides an improved SMOTE algorithm (BIRCH-SMOTE), which can balance the distribution of data samples in an unbalanced data set and predict the icing event of a fan blade by combining a random forest algorithm (RF). According to the algorithm, the BIRCH hierarchical clustering operation is firstly carried out on original few types of samples, and different concentration areas are divided in each clustering area according to the density of sample points. The lower concentration areas will synthesize more sample, as opposed to higher concentration areas that will require less sample to be synthesized. In order to follow the original distribution of the few classes of samples, new samples are synthesized separately in different concentration zones within each cluster. And secondly, the BIRCH-SMOTE algorithm improves linear interpolation operation, increases randomness in the interpolation process, and effectively avoids the problem of overlapping redundancy of synthesized samples. And finally, training the random forest model by using the balanced data set, and obtaining a fan blade icing prediction result.
Specifically, the method for predicting the icing of the fan under the unbalanced data set comprises the following steps:
step 1) collecting and sorting historical meteorological data and fan running state data of a wind power plant, and finally storing the sorted data in a database for convenient use in prediction; historical meteorological data of the wind power plant, fan running state data and a prediction target can form a fan historical data training vector; the fan historical data training vector specifically comprises the following dimensions: wind speed, wind power generation rotational speed, ambient temperature, the inside temperature of generator, whether fan blade phenomenon of icing appears can be expressed as:
X=[vw,vg,p,te,ti,f]
wherein v iswRepresenting wind speed; v. ofgRepresenting the wind power generation speed; p represents a wind power generation rotation speed; t is teRepresents the ambient temperature; t is tiRepresents the generator internal temperature; f represents whether the fan blade is frozen or not;
step 2), fan historical data samples composed of collected and sorted fan historical data training vectors are subjected to 'range method' normalization processing, so that the processed data are more suitable for learning model training; wherein the normalized calculation formula is as follows:
in the formula, X represents a fan historical data sample; xminRepresenting a minimum value in the fan historical data samples; xmaxRepresenting a maximum value in the fan historical data samples; xnewThe processed fan historical data samples are obtained;
step 3) according to the data distribution condition of a few types of samples in the fan historical data samples and the size of a Calinski-Harabasz Index scoring coefficient under the condition of different clustering numbers; the minority samples are data samples corresponding to the fan icing in the fan historical data samples; firstly, the possible BIRCH clustering number in a set is assumed, and then after the Calinski-Harabasz Index scoring coefficient is determined, the clustering number z with the highest selected scoring coefficient is obtained; Calinski-Harabasz Index coefficient calculation formula is as follows:
wherein B iszRepresenting a covariance matrix between the classes; wzRepresenting a covariance matrix inside a certain cluster; tr (-) denotes the trace of the computation matrix; m represents the number of minority samples in the historical data samples of the fan; z represents the highest cluster number; k represents the number of cluster clusters;
step 4) performing normalization processing on fan historical data sample X by using a pole difference methodnewAfter z clustering clusters are obtained by using a BIRCH clustering algorithm, the clustering results are stored in a data set D { cluster _1, cluster _2, cluster _3, cluster _4, cluster _5,... and cluster _ z };
step 5) calculating the density value of a few types of sample points in each cluster in the set D according to a density formula; the following definition can be made for the sample point intensity formula: the sample density value refers to a certain minority class of samples XoriginThe sum of the distances to the surrounding K nearest neighbor homogeneous samples, Density, can be described as:
wherein d isiRepresenting the Euclidean distance between two sample points, i represents sample XoriginOne of K nearest neighbor homogeneous samples around;
step 6), arranging the minority sample points in all the clustering clusters in a descending order according to the density of the minority sample points; according to the sequencing result, dividing the sample points in the clusters into three concentration areas of high concentration, medium concentration and low concentration equally;
step 7) in each concentration area in each cluster, K nearest neighbor minority samples are searched for the minority sample points, and the selection standard of the nearest neighbors is measured by using an Euclidean distance formula, namely K samples with the minimum Euclidean distance to the K samples around the minority sample points are selected; wherein, sample xjAnd sample xiThe Euclidean distance between them is:
andrespectively represent samples xjAnd xiThe value in the t dimension; sample xjCan be expressed asWherein represents xjThere are n dimensions in total; selecting high _ num minority samples from the nearest K minority samples found from the minority sample points for the high concentration area; selecting midle _ num minority samples from the nearest K minority samples found from the minority sample points in the medium concentration area; selecting low-num minority samples from K nearest neighbor minority samples found from the minority sample points for the low-concentration area;
step 8) synthesizing a new sample according to the following formula by using the high _ num, low _ num and middle _ num few class samples selected in the step 7); after synthesis is carried out according to the following formula, each few type sample in the high-concentration area can generate high _ num new sample points, and each few type sample in the corresponding medium-concentration area and low-concentration area can respectively obtain middle _ num and low _ num new sample points;
wherein, Xnew_1Generating a sample point for the new time; xorigin,iThe ith dimension characteristic of a few types of samples selected from different concentration areas in the step 7); xneighbor,iRepresents Xorigin,iIn step 7), the ith dimension characteristic of a certain adjacent sample point in the minority sample selected from the different concentration areas is obtained; n, n is the total number of dimensions of the sample; rand (0,1) represents a random number between 0 and 1;
step 9) noise sample points may be introduced after interpolation operation is performed on a few types of samples, so that denoising of the synthesized data set is required; judging whether the newly generated sample point is noise or not by identifying the attributes of the adjacent sample points of the newly generated minority samples; scanning all newly generated minority sample points and deleting noise points; the specific noise point identification process is as follows:
e) calculating a newly generated sample point Xnew_120 nearest neighbor samples in the fan historical data samples; let m' be the number of samples belonging to the minority class of samples among the 20 nearest neighbor samples;
f) if m' is 20, the newly generated sample point X is determinednew_1Is a noise point;
g) if m' is less than or equal to 0 and less than or equal to 10, judging the new sample point Xnew_1The method is a safe new sample point, is not a noise point and does not perform any operation;
h) if m' is less than or equal to 10 and less than or equal to 20, judging the new sample point Xnew_1For the dangerous point, some new few sample points need to be generated near this point and added to the DANIn the GER set;
finally, for each newly generated sample point X in the DANGER setnew_1Generating a new sample by using a smote algorithm, and removing all noise points; wherein the formula of the smote algorithm can be expressed as:
Xnew_2=Xnew_1+rand(0,1)×(Xj-Xnew_1),j=1,2,...,N
in the formula, Xnew_2Representing newly synthesized minority class samples; xnew_1Representing the original few classes of samples used to synthesize the new sample; rand (0,1) represents some random number between 0 and 1; xjRepresenting the original minority class sample Xnew_1A randomly selected one of the K neighboring samples in the new few class sample points, N being the total number of the new few class sample points.
The residual new synthesized sample points after denoising are compared with XnewMerging, storing the merging result into a final synthesized sample set;
step 10), training data in the 'final synthetic sample set' by using a random forest model to obtain a fan icing prediction model; the operation data of the No. 2 fan is used as test data to check the effectiveness of the wind driven generator blade icing prediction method based on the unbalanced data set; in the embodiment, AUC is used as an evaluation standard of the fan blade icing prediction effect; AUC is a quantitative measure of the quality of the classifier, and is generally between 0.5 and 1, and a higher AUC value indicates better performance of the classifier, and if the AUC is 0.5, the classification is equivalent to complete random classification.
Has the advantages that: the method utilizes a BIRCH clustering algorithm to cluster original rare class data samples, and divides different concentration areas in each clustering area according to the density of sample points. The lower concentration areas will synthesize more sample, as opposed to higher concentration areas that will require less sample to be synthesized. In order to follow the original distribution of the few classes of samples, new samples are synthesized separately in different concentration zones within each cluster. And secondly, the BIRCH-SMOTE algorithm improves linear interpolation operation, increases randomness in the interpolation process, and effectively avoids the problem of overlapping redundancy of synthesized samples. After a few new samples are synthesized, in order to avoid introducing noise sample points due to sample synthesis operation, the invention carries out denoising processing on the newly synthesized sample points. And finally, training the random forest model by using the balanced data set, and obtaining a fan blade icing prediction result. The BIRCH-SMOTE significantly improves the accuracy of fan blade icing predictions in the case of unbalanced data sets compared to improvements in the standard SMOTE algorithm and other literature.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a graph comparing the operation effects of the embodiment of the present invention (BIRCH-SMOTE algorithm).
Detailed Description
As shown in FIG. 1, the method for predicting the icing condition of the blade of the wind driven generator based on the unbalanced data set comprises the following steps:
step 1) collecting and sorting historical meteorological data and fan running state data of a wind power plant, and finally storing the sorted data in a database for convenient use in prediction; historical meteorological data of the wind power plant, running state data of the fan and a prediction target (whether the fan is frozen) can form a fan historical data training vector; the fan historical data training vector specifically comprises the following dimensions: wind speed, wind power generation rotational speed, ambient temperature, the inside temperature of generator, whether fan blade phenomenon of icing appears can be expressed as:
X=[vw,vg,p,te,ti,f]
wherein v iswRepresenting wind speed; v. ofgRepresenting the wind power generation speed; p represents a wind power generation rotation speed; t is teRepresents the ambient temperature; t is tiRepresents the generator internal temperature; f represents whether the fan blade is frozen or not; in this embodiment, real-time data of 8-12 month operation of 2 wind power generators in a wind farm in a certain place is collected and collated, and the specific distribution condition of the data is as follows:
step 2), fan historical data samples composed of collected and sorted fan historical data training vectors are subjected to 'range method' normalization processing, so that the processed data are more suitable for learning model training; wherein the normalized calculation formula is as follows:
in the formula, X represents a fan historical data sample; xminRepresenting a minimum value in the fan historical data samples; xmaxRepresenting a maximum value in the fan historical data samples; xnewThe processed fan historical data samples are obtained;
step 3) according to the data distribution condition of a few types of samples in the fan historical data samples and the size of a Calinski-Harabasz Index scoring coefficient under the condition of different clustering numbers; the minority samples are data samples corresponding to the fan icing in the fan historical data samples; firstly, the number of possible BIRCH clusters in the set is assumed, which is respectively: 3 cluster clusters, 4 cluster clusters, 5 cluster clusters, 6 cluster clusters, 7 cluster clusters and 8 cluster clusters; then obtaining the number z of the clustering clusters with the highest selection score coefficient after the Calinski-HarabaszIndex score coefficient is determined; Calinski-Harabasz Index coefficient calculation formula is as follows:
wherein B iszRepresenting a covariance matrix between the classes; wzRepresenting a covariance matrix inside a certain cluster; tr (-) denotes the trace of the computation matrix; m represents the number of minority samples in the historical data samples of the fan; z represents the highest cluster number; k represents the number of cluster clusters;
and finally, obtaining 6 clusters with the highest selected scoring coefficient after the Calinski-Harabasz Index scoring coefficient is determined, namely Z is 6.
Step 4) normalizing the data sample X subjected to the pole difference methodnewAfter the few samples in the data set are clustered into 6 cluster clusters by using a BIRCH clustering algorithm, storing clustering results in a data set D (cluster _1, cluster _2, cluster _3, cluster _4, cluster _5 and cluster _ 6);
step 5) calculating the density value of a few types of sample points in each cluster in the set D according to a density formula; the following definition can be made for the sample point intensity formula: the sample density value refers to a certain minority class of samples XoriginThe sum of the distances to the surrounding K nearest neighbor homogeneous samples, Density, can be described as:
wherein d isiRepresenting the Euclidean distance between two sample points, i represents sample XoriginOne of K nearest neighbor homogeneous samples around; through experimental exploration, the K value suitable for the ventilator data in this embodiment should be 15, that is, 15 nearest samples around each minority class sample are selected as the basis for calculating the density value.
Step 6), arranging the minority sample points in all the clustering clusters in a descending order according to the density of the minority sample points; according to the sequencing result, dividing the sample points in the clusters into three concentration areas of high concentration, medium concentration and low concentration equally;
step 7) in each concentration area in each cluster, finding 15 nearest neighbor minority samples for the minority sample points, wherein the selection standard of the nearest neighbors is measured by using an Euclidean distance formula, namely, selecting 15 samples with the minimum Euclidean distance to the minority sample points; wherein, sample xjAnd sample xiThe Euclidean distance between them is:
andrespectively represent samples xjAnd xiThe value in the t dimension; sample xjCan be expressed asWherein represents xjThere are n dimensions in total; selecting high _ num as 7 minority samples from the 15 nearest neighbor minority samples searched from the minority sample points for the high concentration area; selecting midle _ num as 11 minority samples from the 15 nearest neighbor minority samples found from the minority sample points for the medium concentration area; selecting 13 minority samples from the 15 nearest neighbor minority samples searched from the minority sample points for the low concentration area;
step 8) using the selected 7, 11 and 13 minority samples in the step 7) to synthesize new samples according to the following formula; after synthesis according to the following formula, each few type sample in the high-concentration area can generate high _ num ═ 7 new sample points, and each few type sample in the corresponding medium-concentration area and low-concentration area can respectively obtain middle _ num ═ 11 and low _ num ═ 13 new sample points;
wherein, Xnew_1Generating a sample point for the new time; xorigin,iThe ith dimension characteristic of a few types of samples selected from different concentration areas in the step 7); xneighbor,iRepresents Xorigin,iIn step 7), the ith dimension characteristic of a certain adjacent sample point in the minority sample selected from the different concentration areas is obtained; n, n is the total number of dimensions of the sample; rand (0,1) represents a random number between 0 and 1;
step 9) noise sample points may be introduced after interpolation operation is performed on a few types of samples, so that denoising of the synthesized data set is required; judging whether the newly generated sample point is noise or not by identifying the attributes of the adjacent sample points of the newly generated minority samples; scanning all newly generated minority sample points and deleting noise points; the specific noise point identification process is as follows:
i) calculating a newly generated sample point Xnew_120 nearest neighbor samples in the fan historical data samples; let m' be the number of samples belonging to the minority class of samples among the 20 nearest neighbor samples;
j) if m' is 20, the newly generated sample point X is determinednew_1Is a noise point;
k) if m' is less than or equal to 0 and less than or equal to 10, judging the new sample point Xnew_1The method is a safe new sample point, is not a noise point and does not perform any operation;
l) if m' is less than or equal to 20 and is more than or equal to 10, judging the new sample point Xnew_1Generating some new few sample points near the dangerous point and adding the sample points into the DANGER set;
finally, for each newly generated sample point X in the DANGER setnew_1Generating a new sample by using a smote algorithm, and removing all noise points; wherein the formula of the smote algorithm can be expressed as:
Xnew_2=Xnew_1+rand(0,1)×(Xj-Xnew_1),j=1,2,...,N
in the formula, Xnew_2Representing newly synthesized minority class samples; xnew_1Representing the original few classes of samples used to synthesize the new sample; rand (0,1) represents some random number between 0 and 1; xjRepresenting the original minority class sample Xnew_1A randomly selected one of the K neighboring samples in the new few class sample points, N being the total number of the new few class sample points.
The residual new synthesized sample points after denoising are compared with XnewMerging, storing the merging result into a final synthesized sample set;
step 10), training data in the 'final synthetic sample set' by using a random forest model to obtain a fan icing prediction model; the operation data of the No. 2 fan is used as test data to check the effectiveness of the wind driven generator blade icing prediction method based on the unbalanced data set; in the embodiment, AUC is used as an evaluation standard of the fan blade icing prediction effect; AUC is a quantitative measure of the quality of the classifier, and is generally between 0.5 and 1, and a higher AUC value indicates better performance of the classifier, and if the AUC is 0.5, the classification is equivalent to complete random classification.
Claims (3)
1. A method for predicting icing of a fan blade based on an unbalanced data set is characterized by comprising the following steps: the method comprises the following steps:
step 1) collecting and sorting historical meteorological data and fan running state data of a wind power plant, and finally storing the sorted data in a database for convenient use in prediction; the historical meteorological data of the wind power plant, the fan running state data and the prediction target form a fan historical data training vector, and the fan historical data training vector specifically comprises the following dimensions: wind speed, wind power generation rotational speed, ambient temperature, generator internal temperature, whether the phenomenon of freezing appears in the fan blade are expressed as:
X=[vw,vg,p,te,ti,f]
wherein v iswRepresenting wind speed; v. ofgRepresenting the wind power generation speed; p represents a wind power generation rotation speed; t is teRepresents the ambient temperature; t is tiRepresents the generator internal temperature; f represents whether the fan blade is frozen or not;
step 2), fan historical data samples composed of collected and sorted fan historical data training vectors are subjected to 'range method' normalization processing, so that the processed data are more suitable for learning model training; wherein the normalized calculation formula is as follows:
in the formula, X represents a fan historical data sample; xminRepresenting a minimum value in the fan historical data samples; xmaxRepresenting historical data samples of a wind turbineMaximum value of (d); xnewThe processed fan historical data samples are obtained;
step 3) according to the data distribution condition of a minority sample in the fan historical data sample and the Calinski-Harabasz Index scoring coefficient under the condition of different clustering numbers, wherein the minority sample is a data sample corresponding to the fan icing in the fan historical data sample; firstly, the BIRCH clustering number in a set is assumed, and then the clustering number z with the highest scoring coefficient is obtained after the Calinski-HarabaszIndex scoring coefficient is determined;
step 4) normalizing the data sample X subjected to the pole difference methodnewAfter z clustering clusters are obtained by using a BIRCH clustering algorithm, the clustering results are stored in a data set D { cluster _1, cluster _2, cluster _3, cluster _4, cluster _5,... and cluster _ z };
step 5) calculating the density value of a few types of sample points in each cluster in the set D according to a density formula; the sample point intensity formula is defined as follows: the sample density value refers to a certain minority class of samples XoriginThe sum of the distances to the surrounding K nearest neighbor homogeneous samples, Density, is described by the formula:
wherein d isiRepresenting the Euclidean distance between two sample points, i represents sample XoriginOne of K nearest neighbor homogeneous samples around;
step 6), arranging the minority sample points in all the clustering clusters in a descending order according to the density of the minority sample points; according to the sequencing result, dividing the sample points in the clusters into three concentration areas of high concentration, medium concentration and low concentration equally;
step 7) in each concentration area in each cluster, K nearest neighbor minority samples are searched for the minority sample points, and high _ num minority samples are selected from the K nearest neighbor minority samples searched for the minority sample points in the high concentration area; selecting midle _ num minority samples from the nearest K minority samples found from the minority sample points in the medium concentration area; selecting low-num minority samples from K nearest neighbor minority samples found from the minority sample points for the low-concentration area;
step 8) synthesizing a new sample according to the following formula by using the high _ num, low _ num and middle _ num few class samples selected in the step 7); after synthesis is carried out according to the following formula, each few type sample in the high-concentration area generates high _ num new sample points, and each few type sample in the corresponding medium-concentration area and low-concentration area respectively obtains middle _ num and low _ num new sample points;
wherein, Xnew_1Generating a sample point for the new time; xorigin,iThe ith dimension characteristic of a few types of samples selected from different concentration areas in the step 7); xneighbor,iRepresents Xorigin,iIn step 7), the ith dimension characteristic of a certain adjacent sample point in the minority sample selected from the different concentration areas is obtained; n, n is the total number of dimensions of the sample; rand (0,1) represents a random number between 0 and 1;
step 9) after interpolation operation is carried out on a few types of samples, noise sample points are introduced, so that the synthesized data set needs to be denoised; judging whether the newly generated sample point is noise or not by identifying the attributes of the adjacent sample points of the newly generated minority samples; scanning all newly generated minority sample points and deleting noise points; the residual new synthesized sample points after denoising and the fan historical data sample X processed in the step 2) are comparednewMerging, storing the merging result into a final synthesized sample set; the specific flow of noise point identification is as follows:
a) calculating a newly generated sample point Xnew_120 nearest neighbor samples in the fan historical data samples; let m' be the number of samples belonging to the minority class of samples among the 20 nearest neighbor samples;
b) if m' is 20, the newly generated sample point X is determinednew_1Is a noise point;
c) if m' is less than or equal to 0 and less than or equal to 10, judging the new sample point Xnew_1The method is a safe new sample point, is not a noise point and does not perform any operation;
d) if m' is less than or equal to 10 and less than or equal to 20, judging the new sample point Xnew_1Generating some new few sample points near the dangerous point and adding the sample points into the DANGER set;
finally, for each newly generated sample point X in the DANGER setnew_1Generating a new sample by using a smote algorithm, and removing all noise points; the formula of the smote algorithm is expressed as follows:
Xnew_2=Xnew_1+rand(0,1)×(Xj-Xnew_1),j=1,2,...,N
in the formula, Xnew_2Representing newly synthesized minority class samples; xnew_1Representing the original few classes of samples used to synthesize the new sample; rand (0,1) represents some random number between 0 and 1; xjRepresenting the original minority class sample Xnew_1A certain sample randomly selected among K neighboring samples in the new minority sample points, N being the total number of the new minority sample points;
step 10), training data in the 'final synthetic sample set' by using a random forest model to obtain a fan icing prediction model; taking the operation data of the No. 2 fan as test data to check the effectiveness of the prediction method; adopting AUC as an evaluation standard of the fan blade icing prediction effect; AUC is a quantitative standard for measuring the quality of the classifier, the value is between 0.5 and 1, the higher AUC value indicates that the performance of the classifier is better, and if the AUC is 0.5, the classification is equivalent to complete random classification.
2. The method of claim 1, wherein the method comprises the steps of: the Calinski-Harabasz Index coefficient calculation formula in the step 3) is as follows:
wherein B iszRepresenting a covariance matrix between the classes; wzRepresenting a covariance matrix inside a certain cluster; tr (-) denotes the trace of the computation matrix; m represents the number of minority samples in the historical data samples of the fan; z represents the highest cluster number; k represents the number of cluster clusters.
3. The method of claim 1, wherein the method comprises the steps of: the selection criteria of the neighbors in the step 7) are measured by using an Euclidean distance formula, namely K minority samples with the minimum Euclidean distance around the minority sample points are selected; wherein, sample xjAnd sample xiThe Euclidean distance between them is:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910207037.4A CN109978039B (en) | 2019-03-19 | 2019-03-19 | Fan blade icing prediction method based on unbalanced data set |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910207037.4A CN109978039B (en) | 2019-03-19 | 2019-03-19 | Fan blade icing prediction method based on unbalanced data set |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109978039A CN109978039A (en) | 2019-07-05 |
CN109978039B true CN109978039B (en) | 2020-10-16 |
Family
ID=67079465
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910207037.4A Active CN109978039B (en) | 2019-03-19 | 2019-03-19 | Fan blade icing prediction method based on unbalanced data set |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109978039B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110985315A (en) * | 2019-12-16 | 2020-04-10 | 南京松数科技有限公司 | Early prediction method for detecting icing of fan blade |
CN111242206B (en) * | 2020-01-08 | 2022-06-17 | 吉林大学 | High-resolution ocean water temperature calculation method based on hierarchical clustering and random forests |
CN111310785A (en) * | 2020-01-15 | 2020-06-19 | 杭州华网信息技术有限公司 | National power grid mechanical external damage prediction method |
CN112465245A (en) * | 2020-12-04 | 2021-03-09 | 复旦大学青岛研究院 | Product quality prediction method for unbalanced data set |
CN114330881A (en) * | 2021-12-29 | 2022-04-12 | 南京邮电大学 | Data-driven fan blade icing prediction method and device |
CN117892213B (en) * | 2024-03-18 | 2024-06-25 | 中国水利水电第十四工程局有限公司 | Diagnosis method for icing detection and early warning of wind driven generator blade |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080059392A1 (en) * | 1998-05-01 | 2008-03-06 | Stephen Barnhill | System for providing data analysis services using a support vector machine for processing data received from a remote source |
CN109086793A (en) * | 2018-06-27 | 2018-12-25 | 东北大学 | A kind of abnormality recognition method of wind-driven generator |
-
2019
- 2019-03-19 CN CN201910207037.4A patent/CN109978039B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080059392A1 (en) * | 1998-05-01 | 2008-03-06 | Stephen Barnhill | System for providing data analysis services using a support vector machine for processing data received from a remote source |
CN109086793A (en) * | 2018-06-27 | 2018-12-25 | 东北大学 | A kind of abnormality recognition method of wind-driven generator |
Non-Patent Citations (4)
Title |
---|
CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests;Li Ma等;《BMC Bioinformatics》;20171231;第1-18页 * |
Prediction of Wind Turbine Blades Icing Based on MKB-SMOTE and Random Forest in Imbalanced Data Set;Yangming Ge等;《2017 IEEE Conference on Energy Internet and Energy System Integration》;20180104;第1-6页 * |
不平衡数据处理;无名小卒917;《360个人图书馆》;20170110;第1-7页 * |
基于半监督谱聚类集成的售后客户细分;杨静雅等;《计算机工程与应用》;20190222;第56卷(第02期);第266-271页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109978039A (en) | 2019-07-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109978039B (en) | Fan blade icing prediction method based on unbalanced data set | |
CN109958588B (en) | Icing prediction method, icing prediction device, storage medium, model generation method and model generation device | |
CN110006649B (en) | Bearing fault diagnosis method based on improved ant lion algorithm and support vector machine | |
CN109751206B (en) | Fan blade icing fault prediction method and device and storage medium | |
JP6759966B2 (en) | How to operate the photovoltaic power generation system | |
Xu et al. | Predicting fan blade icing by using particle swarm optimization and support vector machine algorithm | |
CN106248368B (en) | Combustion engine turbine blade fault detection method based on deep learning | |
CN110750524A (en) | Method and system for determining fault characteristics of active power distribution network | |
CN106779200A (en) | Based on the Wind turbines trend prediction method for carrying out similarity in the historical data | |
Ge et al. | Prediction of wind turbine blades icing based on MBK-SMOTE and random forest in imbalanced data set | |
CN104299044A (en) | Clustering-analysis-based wind power short-term prediction system and prediction method | |
CN107944622A (en) | Wind power forecasting method based on continuous time cluster | |
Gagne et al. | Classification of convective areas using decision trees | |
CN103955521B (en) | Cluster classification method for wind power plant | |
CN111931851B (en) | Fan blade icing fault diagnosis method based on one-dimensional residual neural network | |
CN113689053B (en) | Strong convection weather overhead line power failure prediction method based on random forest | |
CN116050666B (en) | Photovoltaic power generation power prediction method for irradiation characteristic clustering | |
CN112832960A (en) | Fan blade icing detection method based on deep learning and storage medium | |
Li et al. | Prediction of wind turbine blades icing based on CJBM with imbalanced data | |
CN114330881A (en) | Data-driven fan blade icing prediction method and device | |
CN112211794B (en) | Cabin temperature abnormity early warning method, device, equipment and medium of wind turbine generator | |
Ma et al. | Anomaly Detection of Mountain Photovoltaic Power Plant Based on Spectral Clustering | |
CN114994451B (en) | Ship electrical equipment fault detection method and system | |
CN116541780A (en) | Power transmission line galloping early warning method, device, equipment and storage medium | |
CN116663393A (en) | Random forest-based power distribution network continuous high-temperature fault risk level prediction method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |