Disclosure of Invention
In order to improve equivalent precision and wide applicability of multiple working conditions of a wind power plant, the invention provides a method for carrying out grouping index dimension reduction Based on eXtreme Gradient Boosting (XGboost) and carrying out cluster division Based on Dynamic Time Warping (DTW) optimization of Density-Based noisy Spatial Clustering (DBSCAN) so as to process multidimensional timing characteristic operation data of a fan, thereby obtaining accurate and effective wind power plant cluster division, which is described in detail in the following description:
a method for clustering within a wind farm based on extreme gradient dynamic density clustering, the method comprising:
selecting indexes of groups in the wind power plant, and carrying out abnormal value detection and interception on corresponding index data in a certain period;
performing dimensionality reduction selection on the preprocessed clustering index data by using XGboost;
and dividing the cluster of the selected index data based on the clustering of DBSCAN-DTW.
The method comprises the following steps of selecting indexes of groups in the wind power plant, and carrying out abnormal value detection and interception on corresponding index data in a certain time period, wherein the abnormal value detection and interception are specifically as follows:
selecting 13 wind power plant grouping indexes, comprising the following steps: the rotor angular speed, the pitch angle, the electromagnetic torque, the mechanical torque, the stator voltage, the active power, the reactive power, the rotor voltage d-axis component, the rotor voltage q-axis component, the stator current d-axis component, the stator current q-axis component, the rotor current d-axis component and the rotor current q-axis component of each fan;
outlier cutoff upper and lower limits are:
in the formula: min and max represent the upper limit and the lower limit of data truncation; q1、Q3Respectively representing upper and lower quartiles; IQR ═ Q3-Q1.
Further, performing dimensionality reduction and selection on the clustering index data by using the XGBoost specifically comprises:
obtaining an objective function of each lifting tree, obtaining an optimal value of the difference between the loss before node segmentation and the loss after node segmentation after node iteration, calculating the average gain of the features to represent the importance degree of the features, and then selecting more important features to realize dimension reduction;
when each feature is split, average gain is recorded, and finally the sum of all average gain values of the feature is divided by the number of times that the feature is used for splitting nodes to obtain a quantitative score of the contribution degree of the feature;
and deleting the features one by one according to the sequence of the contribution degrees from low to high, clustering again, and traversing to obtain an index selection scheme corresponding to the clustering condition with the highest contour value.
The cluster division of the selected index data based on the clustering of DBSCAN-DTW specifically comprises the following steps:
the similarity between the two time sequences is calculated by extending and shortening the time sequences, carrying out some constraints and pruning in the middle, searching an optimal normalization path, and measuring the similarity between the two time sequences by using the sum of Euclidean distances between all similar points, which is called as the normalization path distance.
Further, the method further comprises:
and obtaining optimal clustering by continuously adjusting parameters, selecting clustering indexes by using XGboost to realize feature dimension reduction, performing DBSCAN-DTW clustering again, and outputting clustering results.
The technical scheme provided by the invention has the beneficial effects that:
1. because the dynamic response time of the double-fed Induction Generator (DFIG) of the same model in the wind power plant is different under different wind speeds, and the dynamic response time difference is larger by considering different electrical distances, different fault positions or different fault types, but the timing sequence correspondence is the premise that the traditional euclidean distance, cosine distance and other modes are used in the clustering algorithm, so that the DFIG cannot be effectively clustered. The method adopts a DTW algorithm to calculate the normalization path, solves the problems by a time sequence transformation principle, is more reliable by taking the Euclidean distance between the normalization paths as the similarity index between fans, can effectively solve the problem of partial data loss of the actual wind power plant, and improves the model accuracy;
2. because the distribution of the DFIGs on the multi-dimensional feature space has randomness, similar DFIGs are usually irregular clusters, and the traditional distance-based clustering algorithm can only form spherical clusters, so that similar DFIG outliers can be caused in the clustering process, the model complexity is improved, and the contour value is reduced; the wind power plant has a lot of noise signals in time periods of stable operation, fault occurrence, fault recovery and the like under the multidimensional characteristics, and the traditional clustering method can reduce the clustering accuracy. Clusters of different shapes can be obtained by adopting the DBSCAN algorithm based on density clustering, the sensitivity to noise is low, the noise influence can be eliminated, the problems are solved, and the model simplification degree and accuracy are improved;
3. the operation characteristic dimension of the fan is high, in order to comprehensively consider the characteristics of the DFIG, the electrical index and the mechanical index are high in dimension, strong correlation and noise can exist between the indexes, the clustering speed and the clustering precision can be reduced, and manual index selection has strong subjectivity. According to the method, the XGboost is adopted to select the clustering index, the strong correlation among variables is eliminated, noise and redundant data are filtered, the calculation speed of subsequent clustering is improved, the problem that principal component analysis is not suitable for non-Gaussian distribution data and the principal component is not interpretable is solved, the speed and the accuracy of dimension reduction are remarkably improved compared with the traditional random forest algorithm, and therefore a model is more quickly, accurately and effectively simplified.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Example 1
In order to establish a wind power plant equivalent model for accurately representing the operating characteristics, referring to fig. 1, an embodiment of the present invention provides a wind power plant clustering method based on extreme gradient dynamic density clustering, which is described in detail in the following description:
step 101: selecting indexes of groups in the wind power plant, and carrying out abnormal value detection and abnormal value truncation processing on corresponding index data in a certain period;
step 102: performing dimensionality reduction selection on the preprocessed clustering index data by using XGboost;
step 103: and dividing the cluster of the selected index data by a clustering method based on DBSCAN-DTW.
In summary, the embodiment of the present invention can process the multi-dimensional time sequence characteristic operation data of the wind turbine based on the steps 101 to 103, so as to obtain an accurate and effective wind farm cluster division scheme.
Example 2
The scheme in embodiment 1 is further described below with reference to fig. 1 to 5, specific calculation formulas and examples, and details are described below:
201: detecting and processing abnormal data values;
wherein, select 13 wind-powered electricity generation field grouping indexes, include: the mechanical characteristic indexes of 4 of rotor angular velocity wr, Pitch angle Pitch, electromagnetic torque Tem and mechanical torque Tm of each fan and 9 of stator voltage Vs, active power P, reactive power Q, rotor voltage d-axis component Vrd, rotor voltage Q-axis component Vrq, stator current d-axis component Isd, stator current Q-axis component Isq, rotor current d-axis component Ird and rotor current Q-axis component Irq of each fan. In consideration of actual engineering conditions such as measurement errors and the like, the wind power plant initial data set has more abnormal values, and the subsequent clustering result overall offset can be caused. Data cleaning can be completed by the aid of box graphs for drawing variables based on graph-based detection, and the upper limit and the lower limit of abnormal value truncation are determined by the formula (1).
In the formula: min and max represent the upper limit and the lower limit of data truncation; q1、Q3Respectively represent upper and lower quartiles (terms in mathematical statistics); iqr ═ Q3-Q1。
202: extracting and reducing the dimension of the wind power plant characteristics through XGboost;
the XGboost of the embodiment of the invention is an integrated algorithm based on a regression Tree model, and further optimizes the space and time of engineering problems while realizing a Gradient Boosting Decision Tree (GBDT) algorithm: for example, nodes in the same level can be parallel, and through cross validation, tree building can be stopped in advance when the prediction result is good; regular terms are added into the objective function, so that the complexity of the model is reduced, the generalization capability of the model is improved, and overfitting is effectively prevented; taylor expansion is carried out on the loss function, the calculation speed is obviously improved through approximate processing, and the descending degree of the loss function can be deeper.
The overall structure of the algorithm is as follows (2):
where i is the ith sample, and the input sample data x
iTo find an estimated value
φ(x
i) Is sample data x
iTo the estimated value
Of the mapping relation of (c), and f
kThe K-th tree is trained to obtain a function, and the overall spatial mapping relation is the sum of functions generated by the K trees.
The objective function is expressed as:
wherein, L is a training error, namely a loss function about a predicted value and a true value, and represents the matching degree of the model to a training set; omega (f)t) The regular term represents the complexity of the model, and the more complex the model is, the larger the value is; c is a constant term, yiAre true values.
For the loss function, the regression problem is measured by Mean Square Error (MSE), the classification problem is measured by cross entropy (used to measure the dissimilarity between two probability distributions), and the loss function is:
wherein n is the number of sample points.
The XGboost adds a regular term in the objective function, so that overfitting is avoided, wherein the regular term defines an L1 norm and an L2 norm by constructing parameters of new tree leaf nodes:
L1:Ω(w)=λ||w||1 (5)
L2:Ω(w)=λ||w||2 (6)
where w is the set of scores of leaf nodes, λ can control the score of the leaf nodes not to be too large, i.e. to prevent overfitting, and Ω (w) is a norm.
The embodiment of the invention adopts an L2 norm to define a regular term:
in the formula, W represents the number of leaf nodes of each tree, and μ and λ are both artificially set hyper-parameters.
Gradient lifting tree for the loss function, an approximation is proposed to represent the residual with a negative gradient. The gradient indicates that the directional derivative of the function at that point takes a maximum along that direction, i.e. the direction along this gradient at that point the function changes most rapidly. Each round of training of the method adds a function, which is to reduce the residual error. To reduce the most rapidly, a new model is built each time in the gradient direction in which the residual error decreases.
The preset model needs to train t rounds, and the process of determining the final function is as follows:
wherein f ist(xi) The resulting function is trained in the t tree for the ith sample.
Adding a new function f in a new roundt(xi) To reduce the objective function maximally, in the t-th round, the objective function is:
using taylor expansion to make an approximation, we get:
wherein, f (x) corresponds to a function added in a new round; f' (x) is the derivative of f (x); g
iIs the derivative of the training error; f' (x) is the second derivative of f (x); h is
iIn order to train the second derivative of the error,
training errors in the t-1 tree for the ith sample.
Defining leaf node average gain GjAnd leaf node average gain trend Hj:
Wherein, IjThe sample space corresponding to the jth leaf node.
The new objective function is obtained as:
wherein the content of the first and second substances,
an objective function for each lifting tree; theta is a hyper-parameter of the leaf node prediction score; m is
jThe prediction scores for the leaf nodes.
The above is the target function of the t step, namely, the function is minimized during the t round training, so that the operation is obviously accelerated, more rounds of calculation can be carried out, and the loss function is reduced more.
The generation steps of the lifting tree are as follows:
step 1: for the first tree, each characteristic value of each wind driven generator sample is traversed, the original multiple wind driven generator samples are divided into two parts by the characteristic value, and mean square errors of the two sets are calculated respectively. And finding out the characteristic which minimizes the sum of the mean square errors of the left and right sets (sets are also called nodes) in all the characteristic values of the fan sample, recording the minimum characteristic name and the corresponding characteristic value, and dividing the tree into the left and right nodes according to the minimum characteristic.
Step 2: and repeating the steps until the average gain G of all places where the features are possibly segmented is traversed and is a negative value, and ending the model.
And step 3: when the classification is not performed, the last layer is called a child node of the tree (each node is a set), and the average value of the current predicted values of the samples falling in the child node is the classification result.
For a general regression tree, a square error loss function is adopted, which is that optimization is only needed to be achieved each time according to forward distribution, and the optimization of the whole is guaranteed. Due to the particularity of the squared error, it can be deduced that only the fitting residuals (true-predicted values) are needed at each time, so the input quantity to generate the second tree is a residual, which is the same principle as the last tree.
Namely, the above is a calculation method of the spanning tree, and the above spanning tree is subsequently adopted to select the characteristics of the wind driven generator.
By utilizing the algorithm, the specific steps of the wind driven generator characteristic selection are as follows:
1) inputting a wind driven generator sample x and a loss function L;
2) building a tree by using a greedy algorithm, learning a new function, and fitting the residual error of the last prediction;
3) performing iterative training on the loss function L, so that the smaller the error is, the better the error is;
4) and defining a regular term, calculating complexity, and dividing the spanning tree into a structure s and a weight m.
Conversion of the objective function Y (t)
To obtain
5) After node iteration, an optimal value of the difference between the loss before node segmentation and the loss after node segmentation (namely, the gain of node splitting) is obtained, and the optimal value can be used for calculating the average gain G of the feature to represent the importance degree of the feature, and then the more important feature is selected to realize dimension reduction:
wherein the content of the first and second substances,
representing the score of the left sub-tree,
representing the score of the right sub-tree,
representing the score obtained without segmentation and μ representing the complexity of the new node.
6) And repeating the steps until enough trees are generated, so that the predicted value is closest to the true value, and finishing the algorithm.
The embodiment of the invention applies the algorithm to the dimensionality reduction of the clustering index, and the larger the gain is, the larger the difference of the models before and after splitting is, namely the more the gradient of the target function is reduced, the closer the optimal solution of the target function is. Therefore, for each feature, when splitting, a G value is recorded, and finally, the quantitative score (i.e., the G value and/or the number) of the contribution degree of the feature is obtained by dividing the total G value of the feature by the number of times the feature is used to split the node. The features are deleted one by one according to the sequence of the contribution degrees from low to high and are clustered again, and an index selection scheme corresponding to the highest clustering condition of the contour value (a common index for evaluating the clustering effect, which is well known by those skilled in the art and is not described in detail in the embodiments of the present invention) is obtained after traversal.
Step 203: a clustering method based on DBSCAN-DTW is provided for cluster division.
For any kind of time-series characteristic data, two sequences P and Q are defined, the lengths of which are m and n respectively, and are expressed as P ═ P (P) respectively1,p2,…,pm),Q=(q1,q2,…,qn). When m is equal to n, the distance between a point and a point can be calculated by using the euclidean distance:
due to the influence of wind direction, landform, wake flow and other effects, all units in the same wind power plant are subjected to different wind speeds, and even if the fans are of the same type, the dynamic response time of the fans is different; due to the structure of the collector network inside the wind farm, the actual faults received by different wind turbines during a fault are different, for example, when the wind farm exits a fault, one part of the wind turbines may experience low voltage ride through, and the other part may not be affected. Under the complex conditions, the operation data of different fans are not aligned in time series, and the distance for measuring the similarity between two time series cannot be effectively obtained in a DBSCAN clustering algorithm even if Euclidean distance is used. Therefore, the method adopts DTW to calculate the similarity of the time sequence data in DBSCAN, calculates the similarity between the two time sequences by extending and shortening the time sequence, and searching the optimal normalization Path after some constraints and pruning, and measures the similarity between the two time sequences by using the sum of Euclidean distances between all similar points, which is called the normalization Path Distance (WPD).
DTW is an important method for measuring the similarity of two sequences with different lengths, and the core of the algorithm lies in that the algorithm can break through the restriction of inconsistent sequence lengths, and the points on different sequences are aligned with the points through the extension and contraction of the sequences, so as to calculate the cumulative minimum distance between the points on the two sequences with different lengths, as shown in the following:
where D (i, j) is the distance between the ith point of the sequence p and the jth point of the sequence q.
To align the two sequences, an m × n distance matrix grid is constructed, each element (i, j) in the matrix representing the distance d (p) of two pointsi,qj) The smaller the distance, the higher the similarity. Suppose sequence P is (2,3,6,2,1) and sequence Q is (1,3,6, 4). The distance between them is 1 and the distance matrix grid is shown in fig. 2.
Assuming that the sequence-normalized path is R and K denotes the length of the path, the normalized path is R ═ (R)
1,r
2,…r
k) The normalized path distance function is
The defined warping path needs to satisfy certain constraints:
1) boundary property: the start and end points of the two sequences P and Q must correspond, i.e.R1=(1,1),Rk=(m,n)。
2) Monotonicity: the points on the sequences P and Q must be monotonic so that the two sequences do not intersect.
3) Continuity: points in the sequence can only be matched with adjacent points, and cannot be matched in a spanning mode, namely, 0 is less than or equal to i-i' is less than or equal to 1.
After the above constraint conditions are satisfied, the regular path and the cumulative minimum distance can be calculated, as shown in fig. 3.
The similarity among the multidimensional time sequences is obtained through DTW, and the similarity needs to be substituted into a DBSCAN clustering algorithm for clustering. The algorithm divides the area with sufficient density into clusters and finds arbitrarily shaped clusters in a spatial database with noise, which defines clusters as the largest set of density-connected points. Unlike distance-based clustering algorithms, density-based clustering algorithms can find clusters of arbitrary shape. In a density-based clustering algorithm, high-density regions separated by low-density regions are found in a data set, and the separated high-density regions are taken as an independent category. The DBSCAN algorithm is one of algorithms with higher degree of freedom in the clustering algorithm, can break through the limitation of the clustering algorithm on the number of samples, and is suitable for any dense or non-dense data. It determines how close the samples are based on density, thereby classifying samples that meet the requirements into a category, i.e., a cluster. Unlike other clustering algorithms, there are two important parameters in the DBSCAN algorithm that require manual intervention, namely Epsilon and MinPts, where Epsilon represents the clustering radius and MinPts represents the minimum value of the number of samples in a class.
Wherein the parameters Epsilon and MinPts can separate all samples into three categories: core point: selecting a point M in the samples, setting N as the number of samples with the density capable of being reached, and when N is larger than or equal to MinPts, setting the point M as a core point; boundary points are as follows: selecting a point P in the sample, setting the radius from the point P to a core point M as r, and when r is Epsilon, taking the point P as a boundary point; noise points: and selecting a point Q in the sample, wherein if the point Q and any core point do not meet the density reachable, the point Q is a noise point, as shown in FIG. 4. After the two parameters are set, one point in the samples can be selected as a core point, all samples meeting the condition that the density can reach are found as a category, all the points in the category are ensured to be in the Epsilon neighborhood, the sample set is set as E, and the point m is taken as follows:
Epsilon(m,M)={m∈E|d(m,M)≤Epsilon} (20)
where Epsilon (M, M) is the set of data points within the clustering radius with M as the core point.
By analogy, more core points are found, the algorithm flow is as shown in fig. 1, and the algorithm is terminated after all the core objects have the category.
And calculating the optimal normalization path between fans according to the acquired fan data by adopting DTW (dynamic time warping), calculating the similarity according to the normalized Euclidean distance, using the similarity as a clustering basis of DBSCAN (direct space-based control area network), and obtaining optimal grouping by continuously adjusting parameters. At the moment, the calculation dimensionality is extremely high, and strong correlation problems and redundancy among features may exist, so that XGboost is used for selecting the clustering index, feature dimensionality reduction is realized, DBSCAN-DTW clustering is carried out again, and the clustering result is output.
Example 3
The feasibility of the protocols of examples 1 and 2 is verified below with reference to specific tests, calculations, tables 1-3, as described in detail below: according to the embodiment of the invention, a matlab/simulink simulation platform is utilized to build a wind power plant formed by 16 DFIGs with rated power of 1.5MW, as shown in FIG. 6. The terminal voltage 690V of the DFIG is boosted to 35kV on site in a unit wiring mode of one machine to one machine, the voltage is transmitted to a 35kV/220kV transformer substation through an overhead line and is transmitted to an external power grid, and the initial wind speed data of the wind turbine is shown in the following table.
TABLE 1 initial wind speed
In terms of software configuration, the software written by python code, such as a sklern machine learning library, a vim integrated development editor and anaconda environment management software, is used in the embodiment.
And setting three-phase short circuit faults at the outlet of the wind power plant at a certain period of time, and acquiring 13-dimensional characteristic time sequence data of the transient state and the steady state of each fan. Wherein, the data feature is 13-dimensional, including: the mechanical characteristic indexes of 4 of rotor angular velocity wr, Pitch angle Pitch, electromagnetic torque Tem and mechanical torque Tm of each fan and 9 of stator voltage Vs, active power P, reactive power Q, rotor voltage d-axis component Vrd, rotor voltage Q-axis component Vrq, stator current d-axis component Isd, stator current Q-axis component Isq, rotor current d-axis component Ird and rotor current Q-axis component Irq of each fan. The XGboost and the random forest algorithm are respectively used for reducing the dimension of the 13-dimensional DFIG time sequence characteristics collected in the text, and the obtained index contribution degree sequence is shown in the following table 2.
TABLE 2 index contribution ranking
The XGboost training time is 3.53s, the random forest training time is 52.85s, the XGboost training time is far shorter than that of the random forest, and the XGboost training speed is verified to be fast.
The scheme with the highest contour value obtained after dimension reduction by using XGboost is 0.954, and the corresponding clustering index is 3-dimensional: tem, Vrq, Ird; after the dimensionality reduction is carried out by using a random forest, the obtained scheme contour value with the highest contour value is 0.897, and the corresponding clustering index is also 3-dimensional: tem, Ird, Irq. The contour value of the former is obviously higher than that of the latter, and the contour value of the outlier is set to be 0 in the embodiment of the invention, so that the contour value index is an optimization target which comprehensively considers few outliers and high clustering similarity of DFIG, thereby showing that the XGboost selects the clustering index more accurately.
DTW obtains the Euclidean distance between regular paths, the represented similarity is used for clustering, the maximum search radius is set as the maximum similarity representation value between two fans, all high-dimensional space points are traversed, and solution spaces are screened.
And calculating equivalent parameters of the DFIG by a capacity weighting method through the wind power plant subjected to clustering processing by a clustering algorithm, obtaining equivalent parameters of a current collection network in the wind power plant by a loss invariant principle, continuously building a corresponding matlab/simulink simulation model, and comparing the equivalent parameters with the outlet dynamic response of the original model. In order to verify the clustering accuracy of the DBSCAN-DTW of XGboost dimensionality reduction, the embodiment of the invention uses a K-means clustering algorithm, a DBSCAN-DTW clustering algorithm and a DBSCAN-DTW clustering algorithm of random forest dimensionality reduction to simultaneously carry out the experiment as comparison, and the comparison data is shown in the following table.
TABLE 3 comparison of cluster partitioning results with dynamic equivalent relative deviation
As shown in table 3 above, compared with the equivalent model clustered by other methods, the XGBoost dimension reduction DBSCAN-DTW clustering proposed in the embodiment of the present invention can reduce the voltage deviation, the active power deviation, and the reactive power deviation of the equivalent model, and significantly improve the accuracy.
And (3) evaluating a model:
1) when a three-phase short-circuit fault occurs at a grid-connected point, because the dynamic response time of the DFIGs of the same model is different at different wind speeds, and the dynamic response time difference is larger due to the consideration of different electrical distances, different fault positions or different fault types, the time sequence correspondence is assumed on the premise that the traditional Euclidean distance, cosine distance and other modes are used in a clustering algorithm, so that the DFIGs cannot be effectively clustered. The method adopts a DTW algorithm to calculate the rounding path, solves the problems through a time sequence transformation principle, and improves the accuracy of the model.
2) Because the distribution of the DFIGs on the multi-dimensional feature space has randomness, similar DFIGs are usually irregular clusters, and the traditional distance-based clustering algorithm can only form spherical clusters, so that the clusters can generate mass center offset due to the abnormal points in the clustering process, and the contour value is reduced. The DBSCAN is adopted to solve the problems through a clustering method based on density, and the model accuracy is improved.
3) Because the wind power plant has a lot of noise signals in the time periods of stable operation, fault occurrence, fault recovery and the like under the multidimensional characteristics, the traditional clustering method can reduce the clustering accuracy. The DBSCAN has extremely low sensitivity to noise, has better noise immunity and improves the model accuracy.
4) In order to fully consider the characteristics of the DFIG, the electrical index and the mechanical index have high dimensionality, strong correlation may exist between the indexes, noise may also be contained, and the clustering speed and the clustering precision are reduced. The XGboost can effectively select the clustering characteristics, improve the clustering speed, filter noise and redundant data and remarkably improve the contour value; the XGboost is faster than the dimension reduction speed of Random Forest, the contour value index obtained by dimension reduction is higher, and therefore the model is more quickly, accurately and effectively simplified.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.